Get the max value for each key in a Spark RDD
What is the best way to return the max row (value) associated with each unique key in a spark RDD?
I'm using python and I've tried Math max, mapping and reducing by keys and aggregates. Is there an efficient way to do this? Possibly an UDF?
I have in RDD format:
[(v, 3),
(v, 1),
(v, 1),
(w, 7),
(w, 1),
(x, 3),
(y, 1),
(y, 1),
(y, 2),
(y, 3)]
And I need to return:
[(v, 3),
(w, 7),
(x, 3),
(y, 3)]
Ties can return the first value or random.
python apache-spark pyspark rdd
add a comment |
What is the best way to return the max row (value) associated with each unique key in a spark RDD?
I'm using python and I've tried Math max, mapping and reducing by keys and aggregates. Is there an efficient way to do this? Possibly an UDF?
I have in RDD format:
[(v, 3),
(v, 1),
(v, 1),
(w, 7),
(w, 1),
(x, 3),
(y, 1),
(y, 1),
(y, 2),
(y, 3)]
And I need to return:
[(v, 3),
(w, 7),
(x, 3),
(y, 3)]
Ties can return the first value or random.
python apache-spark pyspark rdd
add a comment |
What is the best way to return the max row (value) associated with each unique key in a spark RDD?
I'm using python and I've tried Math max, mapping and reducing by keys and aggregates. Is there an efficient way to do this? Possibly an UDF?
I have in RDD format:
[(v, 3),
(v, 1),
(v, 1),
(w, 7),
(w, 1),
(x, 3),
(y, 1),
(y, 1),
(y, 2),
(y, 3)]
And I need to return:
[(v, 3),
(w, 7),
(x, 3),
(y, 3)]
Ties can return the first value or random.
python apache-spark pyspark rdd
What is the best way to return the max row (value) associated with each unique key in a spark RDD?
I'm using python and I've tried Math max, mapping and reducing by keys and aggregates. Is there an efficient way to do this? Possibly an UDF?
I have in RDD format:
[(v, 3),
(v, 1),
(v, 1),
(w, 7),
(w, 1),
(x, 3),
(y, 1),
(y, 1),
(y, 2),
(y, 3)]
And I need to return:
[(v, 3),
(w, 7),
(x, 3),
(y, 3)]
Ties can return the first value or random.
python apache-spark pyspark rdd
python apache-spark pyspark rdd
edited Mar 14 '17 at 12:21
SiHa
3,36161733
3,36161733
asked May 4 '16 at 0:17
captainKirk104captainKirk104
4725
4725
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Actually you have a PairRDD. One of the best ways to do it is with reduceByKey:
(Scala)
val grouped = rdd.reduceByKey(math.max(_, _))
(Python)
grouped = rdd.reduceByKey(max)
(Java 7)
JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
new Function2<Integer, Integer, Integer>() {
public Integer call(Integer v1, Integer v2) {
return Math.max(v1, v2);
}
});
(Java 8)
JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
(v1, v2) -> Math.max(v1, v2)
);
API doc for reduceByKey:
- Scala
- Python
- Java
can you give a way to do this in Java as well? I am using java and looking for exactly the same thing
– tsar2512
Jan 24 '17 at 22:47
@tsar2512 With Java 8, this might work:new JavaPairRDD(rdd).reduceByKey((v1, v2) -> Math.max(v1, v2));
– Daniel de Paula
Jan 25 '17 at 9:12
thanks for the response, unfortunately, I am using Java 7 - it does not allow lambda functions. One typically has to write anonymous functions. Could you let me know what would be the solution in Java 7? I suspext a simple comparator function should work!
– tsar2512
Jan 25 '17 at 9:55
Additionally. What we are getting is the max of values which belong to each key. Is that correct?
– tsar2512
Jan 25 '17 at 9:56
@tsar2512, yes, the resulting RDD will contain a single entry for each key, containing a pair (key, maxValue). I updated the answer with versions for Java 7 and Java 8, but I haven't tested them, so please let me know if it works.
– Daniel de Paula
Jan 25 '17 at 10:10
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f37016427%2fget-the-max-value-for-each-key-in-a-spark-rdd%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Actually you have a PairRDD. One of the best ways to do it is with reduceByKey:
(Scala)
val grouped = rdd.reduceByKey(math.max(_, _))
(Python)
grouped = rdd.reduceByKey(max)
(Java 7)
JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
new Function2<Integer, Integer, Integer>() {
public Integer call(Integer v1, Integer v2) {
return Math.max(v1, v2);
}
});
(Java 8)
JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
(v1, v2) -> Math.max(v1, v2)
);
API doc for reduceByKey:
- Scala
- Python
- Java
can you give a way to do this in Java as well? I am using java and looking for exactly the same thing
– tsar2512
Jan 24 '17 at 22:47
@tsar2512 With Java 8, this might work:new JavaPairRDD(rdd).reduceByKey((v1, v2) -> Math.max(v1, v2));
– Daniel de Paula
Jan 25 '17 at 9:12
thanks for the response, unfortunately, I am using Java 7 - it does not allow lambda functions. One typically has to write anonymous functions. Could you let me know what would be the solution in Java 7? I suspext a simple comparator function should work!
– tsar2512
Jan 25 '17 at 9:55
Additionally. What we are getting is the max of values which belong to each key. Is that correct?
– tsar2512
Jan 25 '17 at 9:56
@tsar2512, yes, the resulting RDD will contain a single entry for each key, containing a pair (key, maxValue). I updated the answer with versions for Java 7 and Java 8, but I haven't tested them, so please let me know if it works.
– Daniel de Paula
Jan 25 '17 at 10:10
add a comment |
Actually you have a PairRDD. One of the best ways to do it is with reduceByKey:
(Scala)
val grouped = rdd.reduceByKey(math.max(_, _))
(Python)
grouped = rdd.reduceByKey(max)
(Java 7)
JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
new Function2<Integer, Integer, Integer>() {
public Integer call(Integer v1, Integer v2) {
return Math.max(v1, v2);
}
});
(Java 8)
JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
(v1, v2) -> Math.max(v1, v2)
);
API doc for reduceByKey:
- Scala
- Python
- Java
can you give a way to do this in Java as well? I am using java and looking for exactly the same thing
– tsar2512
Jan 24 '17 at 22:47
@tsar2512 With Java 8, this might work:new JavaPairRDD(rdd).reduceByKey((v1, v2) -> Math.max(v1, v2));
– Daniel de Paula
Jan 25 '17 at 9:12
thanks for the response, unfortunately, I am using Java 7 - it does not allow lambda functions. One typically has to write anonymous functions. Could you let me know what would be the solution in Java 7? I suspext a simple comparator function should work!
– tsar2512
Jan 25 '17 at 9:55
Additionally. What we are getting is the max of values which belong to each key. Is that correct?
– tsar2512
Jan 25 '17 at 9:56
@tsar2512, yes, the resulting RDD will contain a single entry for each key, containing a pair (key, maxValue). I updated the answer with versions for Java 7 and Java 8, but I haven't tested them, so please let me know if it works.
– Daniel de Paula
Jan 25 '17 at 10:10
add a comment |
Actually you have a PairRDD. One of the best ways to do it is with reduceByKey:
(Scala)
val grouped = rdd.reduceByKey(math.max(_, _))
(Python)
grouped = rdd.reduceByKey(max)
(Java 7)
JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
new Function2<Integer, Integer, Integer>() {
public Integer call(Integer v1, Integer v2) {
return Math.max(v1, v2);
}
});
(Java 8)
JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
(v1, v2) -> Math.max(v1, v2)
);
API doc for reduceByKey:
- Scala
- Python
- Java
Actually you have a PairRDD. One of the best ways to do it is with reduceByKey:
(Scala)
val grouped = rdd.reduceByKey(math.max(_, _))
(Python)
grouped = rdd.reduceByKey(max)
(Java 7)
JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
new Function2<Integer, Integer, Integer>() {
public Integer call(Integer v1, Integer v2) {
return Math.max(v1, v2);
}
});
(Java 8)
JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
(v1, v2) -> Math.max(v1, v2)
);
API doc for reduceByKey:
- Scala
- Python
- Java
edited Jan 25 '17 at 10:07
answered May 4 '16 at 0:29
Daniel de PaulaDaniel de Paula
9,76954360
9,76954360
can you give a way to do this in Java as well? I am using java and looking for exactly the same thing
– tsar2512
Jan 24 '17 at 22:47
@tsar2512 With Java 8, this might work:new JavaPairRDD(rdd).reduceByKey((v1, v2) -> Math.max(v1, v2));
– Daniel de Paula
Jan 25 '17 at 9:12
thanks for the response, unfortunately, I am using Java 7 - it does not allow lambda functions. One typically has to write anonymous functions. Could you let me know what would be the solution in Java 7? I suspext a simple comparator function should work!
– tsar2512
Jan 25 '17 at 9:55
Additionally. What we are getting is the max of values which belong to each key. Is that correct?
– tsar2512
Jan 25 '17 at 9:56
@tsar2512, yes, the resulting RDD will contain a single entry for each key, containing a pair (key, maxValue). I updated the answer with versions for Java 7 and Java 8, but I haven't tested them, so please let me know if it works.
– Daniel de Paula
Jan 25 '17 at 10:10
add a comment |
can you give a way to do this in Java as well? I am using java and looking for exactly the same thing
– tsar2512
Jan 24 '17 at 22:47
@tsar2512 With Java 8, this might work:new JavaPairRDD(rdd).reduceByKey((v1, v2) -> Math.max(v1, v2));
– Daniel de Paula
Jan 25 '17 at 9:12
thanks for the response, unfortunately, I am using Java 7 - it does not allow lambda functions. One typically has to write anonymous functions. Could you let me know what would be the solution in Java 7? I suspext a simple comparator function should work!
– tsar2512
Jan 25 '17 at 9:55
Additionally. What we are getting is the max of values which belong to each key. Is that correct?
– tsar2512
Jan 25 '17 at 9:56
@tsar2512, yes, the resulting RDD will contain a single entry for each key, containing a pair (key, maxValue). I updated the answer with versions for Java 7 and Java 8, but I haven't tested them, so please let me know if it works.
– Daniel de Paula
Jan 25 '17 at 10:10
can you give a way to do this in Java as well? I am using java and looking for exactly the same thing
– tsar2512
Jan 24 '17 at 22:47
can you give a way to do this in Java as well? I am using java and looking for exactly the same thing
– tsar2512
Jan 24 '17 at 22:47
@tsar2512 With Java 8, this might work:
new JavaPairRDD(rdd).reduceByKey((v1, v2) -> Math.max(v1, v2));
– Daniel de Paula
Jan 25 '17 at 9:12
@tsar2512 With Java 8, this might work:
new JavaPairRDD(rdd).reduceByKey((v1, v2) -> Math.max(v1, v2));
– Daniel de Paula
Jan 25 '17 at 9:12
thanks for the response, unfortunately, I am using Java 7 - it does not allow lambda functions. One typically has to write anonymous functions. Could you let me know what would be the solution in Java 7? I suspext a simple comparator function should work!
– tsar2512
Jan 25 '17 at 9:55
thanks for the response, unfortunately, I am using Java 7 - it does not allow lambda functions. One typically has to write anonymous functions. Could you let me know what would be the solution in Java 7? I suspext a simple comparator function should work!
– tsar2512
Jan 25 '17 at 9:55
Additionally. What we are getting is the max of values which belong to each key. Is that correct?
– tsar2512
Jan 25 '17 at 9:56
Additionally. What we are getting is the max of values which belong to each key. Is that correct?
– tsar2512
Jan 25 '17 at 9:56
@tsar2512, yes, the resulting RDD will contain a single entry for each key, containing a pair (key, maxValue). I updated the answer with versions for Java 7 and Java 8, but I haven't tested them, so please let me know if it works.
– Daniel de Paula
Jan 25 '17 at 10:10
@tsar2512, yes, the resulting RDD will contain a single entry for each key, containing a pair (key, maxValue). I updated the answer with versions for Java 7 and Java 8, but I haven't tested them, so please let me know if it works.
– Daniel de Paula
Jan 25 '17 at 10:10
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f37016427%2fget-the-max-value-for-each-key-in-a-spark-rdd%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown