Get the max value for each key in a Spark RDD












7















What is the best way to return the max row (value) associated with each unique key in a spark RDD?



I'm using python and I've tried Math max, mapping and reducing by keys and aggregates. Is there an efficient way to do this? Possibly an UDF?



I have in RDD format:



[(v, 3),
(v, 1),
(v, 1),
(w, 7),
(w, 1),
(x, 3),
(y, 1),
(y, 1),
(y, 2),
(y, 3)]


And I need to return:



[(v, 3),
(w, 7),
(x, 3),
(y, 3)]


Ties can return the first value or random.










share|improve this question





























    7















    What is the best way to return the max row (value) associated with each unique key in a spark RDD?



    I'm using python and I've tried Math max, mapping and reducing by keys and aggregates. Is there an efficient way to do this? Possibly an UDF?



    I have in RDD format:



    [(v, 3),
    (v, 1),
    (v, 1),
    (w, 7),
    (w, 1),
    (x, 3),
    (y, 1),
    (y, 1),
    (y, 2),
    (y, 3)]


    And I need to return:



    [(v, 3),
    (w, 7),
    (x, 3),
    (y, 3)]


    Ties can return the first value or random.










    share|improve this question



























      7












      7








      7


      3






      What is the best way to return the max row (value) associated with each unique key in a spark RDD?



      I'm using python and I've tried Math max, mapping and reducing by keys and aggregates. Is there an efficient way to do this? Possibly an UDF?



      I have in RDD format:



      [(v, 3),
      (v, 1),
      (v, 1),
      (w, 7),
      (w, 1),
      (x, 3),
      (y, 1),
      (y, 1),
      (y, 2),
      (y, 3)]


      And I need to return:



      [(v, 3),
      (w, 7),
      (x, 3),
      (y, 3)]


      Ties can return the first value or random.










      share|improve this question
















      What is the best way to return the max row (value) associated with each unique key in a spark RDD?



      I'm using python and I've tried Math max, mapping and reducing by keys and aggregates. Is there an efficient way to do this? Possibly an UDF?



      I have in RDD format:



      [(v, 3),
      (v, 1),
      (v, 1),
      (w, 7),
      (w, 1),
      (x, 3),
      (y, 1),
      (y, 1),
      (y, 2),
      (y, 3)]


      And I need to return:



      [(v, 3),
      (w, 7),
      (x, 3),
      (y, 3)]


      Ties can return the first value or random.







      python apache-spark pyspark rdd






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Mar 14 '17 at 12:21









      SiHa

      3,36161733




      3,36161733










      asked May 4 '16 at 0:17









      captainKirk104captainKirk104

      4725




      4725
























          1 Answer
          1






          active

          oldest

          votes


















          15














          Actually you have a PairRDD. One of the best ways to do it is with reduceByKey:



          (Scala)



          val grouped = rdd.reduceByKey(math.max(_, _))


          (Python)



          grouped = rdd.reduceByKey(max)


          (Java 7)



          JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
          new Function2<Integer, Integer, Integer>() {
          public Integer call(Integer v1, Integer v2) {
          return Math.max(v1, v2);
          }
          });


          (Java 8)



          JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
          (v1, v2) -> Math.max(v1, v2)
          );


          API doc for reduceByKey:




          • Scala

          • Python

          • Java






          share|improve this answer


























          • can you give a way to do this in Java as well? I am using java and looking for exactly the same thing

            – tsar2512
            Jan 24 '17 at 22:47











          • @tsar2512 With Java 8, this might work: new JavaPairRDD(rdd).reduceByKey((v1, v2) -> Math.max(v1, v2));

            – Daniel de Paula
            Jan 25 '17 at 9:12













          • thanks for the response, unfortunately, I am using Java 7 - it does not allow lambda functions. One typically has to write anonymous functions. Could you let me know what would be the solution in Java 7? I suspext a simple comparator function should work!

            – tsar2512
            Jan 25 '17 at 9:55











          • Additionally. What we are getting is the max of values which belong to each key. Is that correct?

            – tsar2512
            Jan 25 '17 at 9:56











          • @tsar2512, yes, the resulting RDD will contain a single entry for each key, containing a pair (key, maxValue). I updated the answer with versions for Java 7 and Java 8, but I haven't tested them, so please let me know if it works.

            – Daniel de Paula
            Jan 25 '17 at 10:10











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f37016427%2fget-the-max-value-for-each-key-in-a-spark-rdd%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          15














          Actually you have a PairRDD. One of the best ways to do it is with reduceByKey:



          (Scala)



          val grouped = rdd.reduceByKey(math.max(_, _))


          (Python)



          grouped = rdd.reduceByKey(max)


          (Java 7)



          JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
          new Function2<Integer, Integer, Integer>() {
          public Integer call(Integer v1, Integer v2) {
          return Math.max(v1, v2);
          }
          });


          (Java 8)



          JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
          (v1, v2) -> Math.max(v1, v2)
          );


          API doc for reduceByKey:




          • Scala

          • Python

          • Java






          share|improve this answer


























          • can you give a way to do this in Java as well? I am using java and looking for exactly the same thing

            – tsar2512
            Jan 24 '17 at 22:47











          • @tsar2512 With Java 8, this might work: new JavaPairRDD(rdd).reduceByKey((v1, v2) -> Math.max(v1, v2));

            – Daniel de Paula
            Jan 25 '17 at 9:12













          • thanks for the response, unfortunately, I am using Java 7 - it does not allow lambda functions. One typically has to write anonymous functions. Could you let me know what would be the solution in Java 7? I suspext a simple comparator function should work!

            – tsar2512
            Jan 25 '17 at 9:55











          • Additionally. What we are getting is the max of values which belong to each key. Is that correct?

            – tsar2512
            Jan 25 '17 at 9:56











          • @tsar2512, yes, the resulting RDD will contain a single entry for each key, containing a pair (key, maxValue). I updated the answer with versions for Java 7 and Java 8, but I haven't tested them, so please let me know if it works.

            – Daniel de Paula
            Jan 25 '17 at 10:10
















          15














          Actually you have a PairRDD. One of the best ways to do it is with reduceByKey:



          (Scala)



          val grouped = rdd.reduceByKey(math.max(_, _))


          (Python)



          grouped = rdd.reduceByKey(max)


          (Java 7)



          JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
          new Function2<Integer, Integer, Integer>() {
          public Integer call(Integer v1, Integer v2) {
          return Math.max(v1, v2);
          }
          });


          (Java 8)



          JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
          (v1, v2) -> Math.max(v1, v2)
          );


          API doc for reduceByKey:




          • Scala

          • Python

          • Java






          share|improve this answer


























          • can you give a way to do this in Java as well? I am using java and looking for exactly the same thing

            – tsar2512
            Jan 24 '17 at 22:47











          • @tsar2512 With Java 8, this might work: new JavaPairRDD(rdd).reduceByKey((v1, v2) -> Math.max(v1, v2));

            – Daniel de Paula
            Jan 25 '17 at 9:12













          • thanks for the response, unfortunately, I am using Java 7 - it does not allow lambda functions. One typically has to write anonymous functions. Could you let me know what would be the solution in Java 7? I suspext a simple comparator function should work!

            – tsar2512
            Jan 25 '17 at 9:55











          • Additionally. What we are getting is the max of values which belong to each key. Is that correct?

            – tsar2512
            Jan 25 '17 at 9:56











          • @tsar2512, yes, the resulting RDD will contain a single entry for each key, containing a pair (key, maxValue). I updated the answer with versions for Java 7 and Java 8, but I haven't tested them, so please let me know if it works.

            – Daniel de Paula
            Jan 25 '17 at 10:10














          15












          15








          15







          Actually you have a PairRDD. One of the best ways to do it is with reduceByKey:



          (Scala)



          val grouped = rdd.reduceByKey(math.max(_, _))


          (Python)



          grouped = rdd.reduceByKey(max)


          (Java 7)



          JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
          new Function2<Integer, Integer, Integer>() {
          public Integer call(Integer v1, Integer v2) {
          return Math.max(v1, v2);
          }
          });


          (Java 8)



          JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
          (v1, v2) -> Math.max(v1, v2)
          );


          API doc for reduceByKey:




          • Scala

          • Python

          • Java






          share|improve this answer















          Actually you have a PairRDD. One of the best ways to do it is with reduceByKey:



          (Scala)



          val grouped = rdd.reduceByKey(math.max(_, _))


          (Python)



          grouped = rdd.reduceByKey(max)


          (Java 7)



          JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
          new Function2<Integer, Integer, Integer>() {
          public Integer call(Integer v1, Integer v2) {
          return Math.max(v1, v2);
          }
          });


          (Java 8)



          JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
          (v1, v2) -> Math.max(v1, v2)
          );


          API doc for reduceByKey:




          • Scala

          • Python

          • Java







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Jan 25 '17 at 10:07

























          answered May 4 '16 at 0:29









          Daniel de PaulaDaniel de Paula

          9,76954360




          9,76954360













          • can you give a way to do this in Java as well? I am using java and looking for exactly the same thing

            – tsar2512
            Jan 24 '17 at 22:47











          • @tsar2512 With Java 8, this might work: new JavaPairRDD(rdd).reduceByKey((v1, v2) -> Math.max(v1, v2));

            – Daniel de Paula
            Jan 25 '17 at 9:12













          • thanks for the response, unfortunately, I am using Java 7 - it does not allow lambda functions. One typically has to write anonymous functions. Could you let me know what would be the solution in Java 7? I suspext a simple comparator function should work!

            – tsar2512
            Jan 25 '17 at 9:55











          • Additionally. What we are getting is the max of values which belong to each key. Is that correct?

            – tsar2512
            Jan 25 '17 at 9:56











          • @tsar2512, yes, the resulting RDD will contain a single entry for each key, containing a pair (key, maxValue). I updated the answer with versions for Java 7 and Java 8, but I haven't tested them, so please let me know if it works.

            – Daniel de Paula
            Jan 25 '17 at 10:10



















          • can you give a way to do this in Java as well? I am using java and looking for exactly the same thing

            – tsar2512
            Jan 24 '17 at 22:47











          • @tsar2512 With Java 8, this might work: new JavaPairRDD(rdd).reduceByKey((v1, v2) -> Math.max(v1, v2));

            – Daniel de Paula
            Jan 25 '17 at 9:12













          • thanks for the response, unfortunately, I am using Java 7 - it does not allow lambda functions. One typically has to write anonymous functions. Could you let me know what would be the solution in Java 7? I suspext a simple comparator function should work!

            – tsar2512
            Jan 25 '17 at 9:55











          • Additionally. What we are getting is the max of values which belong to each key. Is that correct?

            – tsar2512
            Jan 25 '17 at 9:56











          • @tsar2512, yes, the resulting RDD will contain a single entry for each key, containing a pair (key, maxValue). I updated the answer with versions for Java 7 and Java 8, but I haven't tested them, so please let me know if it works.

            – Daniel de Paula
            Jan 25 '17 at 10:10

















          can you give a way to do this in Java as well? I am using java and looking for exactly the same thing

          – tsar2512
          Jan 24 '17 at 22:47





          can you give a way to do this in Java as well? I am using java and looking for exactly the same thing

          – tsar2512
          Jan 24 '17 at 22:47













          @tsar2512 With Java 8, this might work: new JavaPairRDD(rdd).reduceByKey((v1, v2) -> Math.max(v1, v2));

          – Daniel de Paula
          Jan 25 '17 at 9:12







          @tsar2512 With Java 8, this might work: new JavaPairRDD(rdd).reduceByKey((v1, v2) -> Math.max(v1, v2));

          – Daniel de Paula
          Jan 25 '17 at 9:12















          thanks for the response, unfortunately, I am using Java 7 - it does not allow lambda functions. One typically has to write anonymous functions. Could you let me know what would be the solution in Java 7? I suspext a simple comparator function should work!

          – tsar2512
          Jan 25 '17 at 9:55





          thanks for the response, unfortunately, I am using Java 7 - it does not allow lambda functions. One typically has to write anonymous functions. Could you let me know what would be the solution in Java 7? I suspext a simple comparator function should work!

          – tsar2512
          Jan 25 '17 at 9:55













          Additionally. What we are getting is the max of values which belong to each key. Is that correct?

          – tsar2512
          Jan 25 '17 at 9:56





          Additionally. What we are getting is the max of values which belong to each key. Is that correct?

          – tsar2512
          Jan 25 '17 at 9:56













          @tsar2512, yes, the resulting RDD will contain a single entry for each key, containing a pair (key, maxValue). I updated the answer with versions for Java 7 and Java 8, but I haven't tested them, so please let me know if it works.

          – Daniel de Paula
          Jan 25 '17 at 10:10





          @tsar2512, yes, the resulting RDD will contain a single entry for each key, containing a pair (key, maxValue). I updated the answer with versions for Java 7 and Java 8, but I haven't tested them, so please let me know if it works.

          – Daniel de Paula
          Jan 25 '17 at 10:10




















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f37016427%2fget-the-max-value-for-each-key-in-a-spark-rdd%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Florida Star v. B. J. F.

          Error while running script in elastic search , gateway timeout

          Retrieve a Users Dashboard in Tumblr with R and TumblR. Oauth Issues