pyspark generate all combinations of unique values












0















I am trying to generate all combination of unique values within my spark dataframe.
The solution, which comes to my mind require usage of itertools.product and pandas dataframe, and therefore it is not efficient enough.
Here is my code:



all_date = [ i.Date for i in df.select("Date").distinct().collect()]
all_stores_id = [i.ID for i in fd.select("ID").distinct().collect()]
all_category = [i.CATEGORY for i in fd.select("CATEGORY").distinct().collect()]
combined = [all_date, all_stores_id, all_category]
all_combination_pdf= pd.DataFrame(columns = ['Date', 'ID', 'CATEGORY'], data=list(itertools.product(*combined)))
# convert pandas dataframe to spark
all_combination_df = sqlContext.createDataFrame(all_combination_pdf)
joined = all_combination_df.join(df,["Date","ID","CATEGORY"],how="left")


Is there any way to change this code to more sparkonic one?



======EDIT======



I've also tried to implement such functionalities using the crossJoin function.
Here is the the code:



test_df = ((df.select('Date').distinct()).crossJoin(df.select('ID').distinct())).crossJoin(df.select('CATEGORY').distinct())
test_df.show(10)


which for some unknown reason raise following exception:



An error occurred while calling o305.showString.
: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.Integer.valueOf(Integer.java:832)









share|improve this question

























  • "some unknown reason" - that error is pretty clear. You're running out of memory. How many distinct values do you have?

    – pault
    Nov 13 '18 at 15:47











  • The choice of word unknown was poor. I understand that the error is due to the memory limitation, but I don't know why it happens. The data sample generates about 1M distinct values and what is more important, the code implemented using pandas works fine. Do you have any idea how to reimplement pandas code into efficient pyspark one?

    – user1877600
    Nov 13 '18 at 20:09
















0















I am trying to generate all combination of unique values within my spark dataframe.
The solution, which comes to my mind require usage of itertools.product and pandas dataframe, and therefore it is not efficient enough.
Here is my code:



all_date = [ i.Date for i in df.select("Date").distinct().collect()]
all_stores_id = [i.ID for i in fd.select("ID").distinct().collect()]
all_category = [i.CATEGORY for i in fd.select("CATEGORY").distinct().collect()]
combined = [all_date, all_stores_id, all_category]
all_combination_pdf= pd.DataFrame(columns = ['Date', 'ID', 'CATEGORY'], data=list(itertools.product(*combined)))
# convert pandas dataframe to spark
all_combination_df = sqlContext.createDataFrame(all_combination_pdf)
joined = all_combination_df.join(df,["Date","ID","CATEGORY"],how="left")


Is there any way to change this code to more sparkonic one?



======EDIT======



I've also tried to implement such functionalities using the crossJoin function.
Here is the the code:



test_df = ((df.select('Date').distinct()).crossJoin(df.select('ID').distinct())).crossJoin(df.select('CATEGORY').distinct())
test_df.show(10)


which for some unknown reason raise following exception:



An error occurred while calling o305.showString.
: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.Integer.valueOf(Integer.java:832)









share|improve this question

























  • "some unknown reason" - that error is pretty clear. You're running out of memory. How many distinct values do you have?

    – pault
    Nov 13 '18 at 15:47











  • The choice of word unknown was poor. I understand that the error is due to the memory limitation, but I don't know why it happens. The data sample generates about 1M distinct values and what is more important, the code implemented using pandas works fine. Do you have any idea how to reimplement pandas code into efficient pyspark one?

    – user1877600
    Nov 13 '18 at 20:09














0












0








0








I am trying to generate all combination of unique values within my spark dataframe.
The solution, which comes to my mind require usage of itertools.product and pandas dataframe, and therefore it is not efficient enough.
Here is my code:



all_date = [ i.Date for i in df.select("Date").distinct().collect()]
all_stores_id = [i.ID for i in fd.select("ID").distinct().collect()]
all_category = [i.CATEGORY for i in fd.select("CATEGORY").distinct().collect()]
combined = [all_date, all_stores_id, all_category]
all_combination_pdf= pd.DataFrame(columns = ['Date', 'ID', 'CATEGORY'], data=list(itertools.product(*combined)))
# convert pandas dataframe to spark
all_combination_df = sqlContext.createDataFrame(all_combination_pdf)
joined = all_combination_df.join(df,["Date","ID","CATEGORY"],how="left")


Is there any way to change this code to more sparkonic one?



======EDIT======



I've also tried to implement such functionalities using the crossJoin function.
Here is the the code:



test_df = ((df.select('Date').distinct()).crossJoin(df.select('ID').distinct())).crossJoin(df.select('CATEGORY').distinct())
test_df.show(10)


which for some unknown reason raise following exception:



An error occurred while calling o305.showString.
: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.Integer.valueOf(Integer.java:832)









share|improve this question
















I am trying to generate all combination of unique values within my spark dataframe.
The solution, which comes to my mind require usage of itertools.product and pandas dataframe, and therefore it is not efficient enough.
Here is my code:



all_date = [ i.Date for i in df.select("Date").distinct().collect()]
all_stores_id = [i.ID for i in fd.select("ID").distinct().collect()]
all_category = [i.CATEGORY for i in fd.select("CATEGORY").distinct().collect()]
combined = [all_date, all_stores_id, all_category]
all_combination_pdf= pd.DataFrame(columns = ['Date', 'ID', 'CATEGORY'], data=list(itertools.product(*combined)))
# convert pandas dataframe to spark
all_combination_df = sqlContext.createDataFrame(all_combination_pdf)
joined = all_combination_df.join(df,["Date","ID","CATEGORY"],how="left")


Is there any way to change this code to more sparkonic one?



======EDIT======



I've also tried to implement such functionalities using the crossJoin function.
Here is the the code:



test_df = ((df.select('Date').distinct()).crossJoin(df.select('ID').distinct())).crossJoin(df.select('CATEGORY').distinct())
test_df.show(10)


which for some unknown reason raise following exception:



An error occurred while calling o305.showString.
: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.Integer.valueOf(Integer.java:832)






pandas pyspark itertools






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 13 '18 at 12:08







user1877600

















asked Nov 13 '18 at 10:37









user1877600user1877600

1691316




1691316













  • "some unknown reason" - that error is pretty clear. You're running out of memory. How many distinct values do you have?

    – pault
    Nov 13 '18 at 15:47











  • The choice of word unknown was poor. I understand that the error is due to the memory limitation, but I don't know why it happens. The data sample generates about 1M distinct values and what is more important, the code implemented using pandas works fine. Do you have any idea how to reimplement pandas code into efficient pyspark one?

    – user1877600
    Nov 13 '18 at 20:09



















  • "some unknown reason" - that error is pretty clear. You're running out of memory. How many distinct values do you have?

    – pault
    Nov 13 '18 at 15:47











  • The choice of word unknown was poor. I understand that the error is due to the memory limitation, but I don't know why it happens. The data sample generates about 1M distinct values and what is more important, the code implemented using pandas works fine. Do you have any idea how to reimplement pandas code into efficient pyspark one?

    – user1877600
    Nov 13 '18 at 20:09

















"some unknown reason" - that error is pretty clear. You're running out of memory. How many distinct values do you have?

– pault
Nov 13 '18 at 15:47





"some unknown reason" - that error is pretty clear. You're running out of memory. How many distinct values do you have?

– pault
Nov 13 '18 at 15:47













The choice of word unknown was poor. I understand that the error is due to the memory limitation, but I don't know why it happens. The data sample generates about 1M distinct values and what is more important, the code implemented using pandas works fine. Do you have any idea how to reimplement pandas code into efficient pyspark one?

– user1877600
Nov 13 '18 at 20:09





The choice of word unknown was poor. I understand that the error is due to the memory limitation, but I don't know why it happens. The data sample generates about 1M distinct values and what is more important, the code implemented using pandas works fine. Do you have any idea how to reimplement pandas code into efficient pyspark one?

– user1877600
Nov 13 '18 at 20:09












1 Answer
1






active

oldest

votes


















0














You can generate the dataframe with this. It just creates a dataframe with the unique values of each column and performs a cross join (cartesian product) with the others.



((df.select('Date').distinct()).crossJoin(df.select('ID').distinct())).crossJoin(df.select('CATEGORY').distinct())


It can be put inside a for loop with some work to automatize it for other dataframes.



Hope this helps






share|improve this answer
























  • Thank you for your answer. Unfortunately, for some reason, which unknown to me, I am not able to execute your line following by test_df.show(10). The error that I get java.lang.OutOfMemoryError: GC overhead limit exceeded. The project is developed on the Azure platform, so this is definitely not a hardware problem.

    – user1877600
    Nov 13 '18 at 11:45











  • Sorry it didn't help. Its certainly strange a memory error in Azure, but think about the huge dataframe that is going to be created. If you have, for example, 3 columns, each of them with 5 different values, u will end up with 5^3 = 125 rows. Imagine with bigger values.

    – Manrique
    Nov 13 '18 at 12:13











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53279125%2fpyspark-generate-all-combinations-of-unique-values%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














You can generate the dataframe with this. It just creates a dataframe with the unique values of each column and performs a cross join (cartesian product) with the others.



((df.select('Date').distinct()).crossJoin(df.select('ID').distinct())).crossJoin(df.select('CATEGORY').distinct())


It can be put inside a for loop with some work to automatize it for other dataframes.



Hope this helps






share|improve this answer
























  • Thank you for your answer. Unfortunately, for some reason, which unknown to me, I am not able to execute your line following by test_df.show(10). The error that I get java.lang.OutOfMemoryError: GC overhead limit exceeded. The project is developed on the Azure platform, so this is definitely not a hardware problem.

    – user1877600
    Nov 13 '18 at 11:45











  • Sorry it didn't help. Its certainly strange a memory error in Azure, but think about the huge dataframe that is going to be created. If you have, for example, 3 columns, each of them with 5 different values, u will end up with 5^3 = 125 rows. Imagine with bigger values.

    – Manrique
    Nov 13 '18 at 12:13
















0














You can generate the dataframe with this. It just creates a dataframe with the unique values of each column and performs a cross join (cartesian product) with the others.



((df.select('Date').distinct()).crossJoin(df.select('ID').distinct())).crossJoin(df.select('CATEGORY').distinct())


It can be put inside a for loop with some work to automatize it for other dataframes.



Hope this helps






share|improve this answer
























  • Thank you for your answer. Unfortunately, for some reason, which unknown to me, I am not able to execute your line following by test_df.show(10). The error that I get java.lang.OutOfMemoryError: GC overhead limit exceeded. The project is developed on the Azure platform, so this is definitely not a hardware problem.

    – user1877600
    Nov 13 '18 at 11:45











  • Sorry it didn't help. Its certainly strange a memory error in Azure, but think about the huge dataframe that is going to be created. If you have, for example, 3 columns, each of them with 5 different values, u will end up with 5^3 = 125 rows. Imagine with bigger values.

    – Manrique
    Nov 13 '18 at 12:13














0












0








0







You can generate the dataframe with this. It just creates a dataframe with the unique values of each column and performs a cross join (cartesian product) with the others.



((df.select('Date').distinct()).crossJoin(df.select('ID').distinct())).crossJoin(df.select('CATEGORY').distinct())


It can be put inside a for loop with some work to automatize it for other dataframes.



Hope this helps






share|improve this answer













You can generate the dataframe with this. It just creates a dataframe with the unique values of each column and performs a cross join (cartesian product) with the others.



((df.select('Date').distinct()).crossJoin(df.select('ID').distinct())).crossJoin(df.select('CATEGORY').distinct())


It can be put inside a for loop with some work to automatize it for other dataframes.



Hope this helps







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 13 '18 at 11:20









ManriqueManrique

500114




500114













  • Thank you for your answer. Unfortunately, for some reason, which unknown to me, I am not able to execute your line following by test_df.show(10). The error that I get java.lang.OutOfMemoryError: GC overhead limit exceeded. The project is developed on the Azure platform, so this is definitely not a hardware problem.

    – user1877600
    Nov 13 '18 at 11:45











  • Sorry it didn't help. Its certainly strange a memory error in Azure, but think about the huge dataframe that is going to be created. If you have, for example, 3 columns, each of them with 5 different values, u will end up with 5^3 = 125 rows. Imagine with bigger values.

    – Manrique
    Nov 13 '18 at 12:13



















  • Thank you for your answer. Unfortunately, for some reason, which unknown to me, I am not able to execute your line following by test_df.show(10). The error that I get java.lang.OutOfMemoryError: GC overhead limit exceeded. The project is developed on the Azure platform, so this is definitely not a hardware problem.

    – user1877600
    Nov 13 '18 at 11:45











  • Sorry it didn't help. Its certainly strange a memory error in Azure, but think about the huge dataframe that is going to be created. If you have, for example, 3 columns, each of them with 5 different values, u will end up with 5^3 = 125 rows. Imagine with bigger values.

    – Manrique
    Nov 13 '18 at 12:13

















Thank you for your answer. Unfortunately, for some reason, which unknown to me, I am not able to execute your line following by test_df.show(10). The error that I get java.lang.OutOfMemoryError: GC overhead limit exceeded. The project is developed on the Azure platform, so this is definitely not a hardware problem.

– user1877600
Nov 13 '18 at 11:45





Thank you for your answer. Unfortunately, for some reason, which unknown to me, I am not able to execute your line following by test_df.show(10). The error that I get java.lang.OutOfMemoryError: GC overhead limit exceeded. The project is developed on the Azure platform, so this is definitely not a hardware problem.

– user1877600
Nov 13 '18 at 11:45













Sorry it didn't help. Its certainly strange a memory error in Azure, but think about the huge dataframe that is going to be created. If you have, for example, 3 columns, each of them with 5 different values, u will end up with 5^3 = 125 rows. Imagine with bigger values.

– Manrique
Nov 13 '18 at 12:13





Sorry it didn't help. Its certainly strange a memory error in Azure, but think about the huge dataframe that is going to be created. If you have, for example, 3 columns, each of them with 5 different values, u will end up with 5^3 = 125 rows. Imagine with bigger values.

– Manrique
Nov 13 '18 at 12:13


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53279125%2fpyspark-generate-all-combinations-of-unique-values%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Florida Star v. B. J. F.

Error while running script in elastic search , gateway timeout

Adding quotations to stringified JSON object values