need to find the latest records for a composite keys in spark sql












0















I need to find the latest records of full_national_number based on the date. Can someone please suggest the solution?



My data is



+--------------------+-----------------------+----------+
|full_national_number|derived_sequence_number| ts|
+--------------------+-----------------------+----------+
| A00000001 | 0000|1111-11-11|
| A00000001 | 0001|1111-11-11|
| A00000001 | 0002|1111-11-11|
| A00000002 | 0000|1111-11-11|
| A00000002 | 0001|1111-11-11|
| A00000002 | 0002|1111-11-11|
| A00000003 | 0000|1111-11-11|
| A00000003 | 0001|1111-11-11|
| A00000004 | 0000|1111-11-11|
| A000000010 | 0000|1111-11-11|
| A000000011 | 0000|1111-11-11|
| A00000008 | 0000|2018-11-16|
| A00000008 | 0001|2018-11-16|
| A00000008 | 0002|2018-11-16|
| A00000002 | 0000|2018-11-16|
| A00000003 | 0000|2018-11-16|
| A00000004 | 0000|2018-11-16|
| A00000005 | 0000|2018-11-16|
+--------------------+-----------------------+----------+



My expected output should be






+--------------------+-----------------------+----------+
|full_national_number|derived_sequence_number| ts|
+--------------------+-----------------------+----------+
| A00000001 | 0000|1111-11-11|
| A00000001 | 0001|1111-11-11|
| A00000001 | 0002|1111-11-11|
| A00000002 | 0000|2018-11-16|
| A00000003 | 0000|2018-11-16|
| A00000004 | 0000|2018-11-16|
| A00000005 | 0000|2018-11-16|
| A00000008 | 0000|2018-11-16|
| A00000008 | 0001|2018-11-16|
| A00000008 | 0002|2018-11-16|
| A000000010 | 0000|1111-11-11|
| A000000011 | 0000|1111-11-11|
+--------------------+-----------------------+----------+





I have tried the below option but got an error.




sqlContext.sql("select full_national_number, derived_sequence_number,
max(ts) from (select *,to_date('1111-11-11') as ts from t1 union all
select *,current_date from t2) unioned group by
full_national_number").show()




The error I get is




Traceback (most recent call last):

File "", line 1, in (module) File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/pyspark/sql/context.py",
line 580, in sql
return DataFrame(self._ssql_ctx.sql(sqlQuery), self)

File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call

File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/pyspark/sql/utils.py",
line 51, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u"expression
'derived_sequence_number' is neither present in the group by, nor is
it an aggregate function. Add to group by or wrap in first() (or
first_value) if you don't care which value you get.;"




Please suggest me a solution for this.










share|improve this question





























    0















    I need to find the latest records of full_national_number based on the date. Can someone please suggest the solution?



    My data is



    +--------------------+-----------------------+----------+
    |full_national_number|derived_sequence_number| ts|
    +--------------------+-----------------------+----------+
    | A00000001 | 0000|1111-11-11|
    | A00000001 | 0001|1111-11-11|
    | A00000001 | 0002|1111-11-11|
    | A00000002 | 0000|1111-11-11|
    | A00000002 | 0001|1111-11-11|
    | A00000002 | 0002|1111-11-11|
    | A00000003 | 0000|1111-11-11|
    | A00000003 | 0001|1111-11-11|
    | A00000004 | 0000|1111-11-11|
    | A000000010 | 0000|1111-11-11|
    | A000000011 | 0000|1111-11-11|
    | A00000008 | 0000|2018-11-16|
    | A00000008 | 0001|2018-11-16|
    | A00000008 | 0002|2018-11-16|
    | A00000002 | 0000|2018-11-16|
    | A00000003 | 0000|2018-11-16|
    | A00000004 | 0000|2018-11-16|
    | A00000005 | 0000|2018-11-16|
    +--------------------+-----------------------+----------+



    My expected output should be






    +--------------------+-----------------------+----------+
    |full_national_number|derived_sequence_number| ts|
    +--------------------+-----------------------+----------+
    | A00000001 | 0000|1111-11-11|
    | A00000001 | 0001|1111-11-11|
    | A00000001 | 0002|1111-11-11|
    | A00000002 | 0000|2018-11-16|
    | A00000003 | 0000|2018-11-16|
    | A00000004 | 0000|2018-11-16|
    | A00000005 | 0000|2018-11-16|
    | A00000008 | 0000|2018-11-16|
    | A00000008 | 0001|2018-11-16|
    | A00000008 | 0002|2018-11-16|
    | A000000010 | 0000|1111-11-11|
    | A000000011 | 0000|1111-11-11|
    +--------------------+-----------------------+----------+





    I have tried the below option but got an error.




    sqlContext.sql("select full_national_number, derived_sequence_number,
    max(ts) from (select *,to_date('1111-11-11') as ts from t1 union all
    select *,current_date from t2) unioned group by
    full_national_number").show()




    The error I get is




    Traceback (most recent call last):

    File "", line 1, in (module) File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/pyspark/sql/context.py",
    line 580, in sql
    return DataFrame(self._ssql_ctx.sql(sqlQuery), self)

    File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call

    File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/pyspark/sql/utils.py",
    line 51, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u"expression
    'derived_sequence_number' is neither present in the group by, nor is
    it an aggregate function. Add to group by or wrap in first() (or
    first_value) if you don't care which value you get.;"




    Please suggest me a solution for this.










    share|improve this question



























      0












      0








      0


      1






      I need to find the latest records of full_national_number based on the date. Can someone please suggest the solution?



      My data is



      +--------------------+-----------------------+----------+
      |full_national_number|derived_sequence_number| ts|
      +--------------------+-----------------------+----------+
      | A00000001 | 0000|1111-11-11|
      | A00000001 | 0001|1111-11-11|
      | A00000001 | 0002|1111-11-11|
      | A00000002 | 0000|1111-11-11|
      | A00000002 | 0001|1111-11-11|
      | A00000002 | 0002|1111-11-11|
      | A00000003 | 0000|1111-11-11|
      | A00000003 | 0001|1111-11-11|
      | A00000004 | 0000|1111-11-11|
      | A000000010 | 0000|1111-11-11|
      | A000000011 | 0000|1111-11-11|
      | A00000008 | 0000|2018-11-16|
      | A00000008 | 0001|2018-11-16|
      | A00000008 | 0002|2018-11-16|
      | A00000002 | 0000|2018-11-16|
      | A00000003 | 0000|2018-11-16|
      | A00000004 | 0000|2018-11-16|
      | A00000005 | 0000|2018-11-16|
      +--------------------+-----------------------+----------+



      My expected output should be






      +--------------------+-----------------------+----------+
      |full_national_number|derived_sequence_number| ts|
      +--------------------+-----------------------+----------+
      | A00000001 | 0000|1111-11-11|
      | A00000001 | 0001|1111-11-11|
      | A00000001 | 0002|1111-11-11|
      | A00000002 | 0000|2018-11-16|
      | A00000003 | 0000|2018-11-16|
      | A00000004 | 0000|2018-11-16|
      | A00000005 | 0000|2018-11-16|
      | A00000008 | 0000|2018-11-16|
      | A00000008 | 0001|2018-11-16|
      | A00000008 | 0002|2018-11-16|
      | A000000010 | 0000|1111-11-11|
      | A000000011 | 0000|1111-11-11|
      +--------------------+-----------------------+----------+





      I have tried the below option but got an error.




      sqlContext.sql("select full_national_number, derived_sequence_number,
      max(ts) from (select *,to_date('1111-11-11') as ts from t1 union all
      select *,current_date from t2) unioned group by
      full_national_number").show()




      The error I get is




      Traceback (most recent call last):

      File "", line 1, in (module) File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/pyspark/sql/context.py",
      line 580, in sql
      return DataFrame(self._ssql_ctx.sql(sqlQuery), self)

      File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call

      File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/pyspark/sql/utils.py",
      line 51, in deco
      raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u"expression
      'derived_sequence_number' is neither present in the group by, nor is
      it an aggregate function. Add to group by or wrap in first() (or
      first_value) if you don't care which value you get.;"




      Please suggest me a solution for this.










      share|improve this question
















      I need to find the latest records of full_national_number based on the date. Can someone please suggest the solution?



      My data is



      +--------------------+-----------------------+----------+
      |full_national_number|derived_sequence_number| ts|
      +--------------------+-----------------------+----------+
      | A00000001 | 0000|1111-11-11|
      | A00000001 | 0001|1111-11-11|
      | A00000001 | 0002|1111-11-11|
      | A00000002 | 0000|1111-11-11|
      | A00000002 | 0001|1111-11-11|
      | A00000002 | 0002|1111-11-11|
      | A00000003 | 0000|1111-11-11|
      | A00000003 | 0001|1111-11-11|
      | A00000004 | 0000|1111-11-11|
      | A000000010 | 0000|1111-11-11|
      | A000000011 | 0000|1111-11-11|
      | A00000008 | 0000|2018-11-16|
      | A00000008 | 0001|2018-11-16|
      | A00000008 | 0002|2018-11-16|
      | A00000002 | 0000|2018-11-16|
      | A00000003 | 0000|2018-11-16|
      | A00000004 | 0000|2018-11-16|
      | A00000005 | 0000|2018-11-16|
      +--------------------+-----------------------+----------+



      My expected output should be






      +--------------------+-----------------------+----------+
      |full_national_number|derived_sequence_number| ts|
      +--------------------+-----------------------+----------+
      | A00000001 | 0000|1111-11-11|
      | A00000001 | 0001|1111-11-11|
      | A00000001 | 0002|1111-11-11|
      | A00000002 | 0000|2018-11-16|
      | A00000003 | 0000|2018-11-16|
      | A00000004 | 0000|2018-11-16|
      | A00000005 | 0000|2018-11-16|
      | A00000008 | 0000|2018-11-16|
      | A00000008 | 0001|2018-11-16|
      | A00000008 | 0002|2018-11-16|
      | A000000010 | 0000|1111-11-11|
      | A000000011 | 0000|1111-11-11|
      +--------------------+-----------------------+----------+





      I have tried the below option but got an error.




      sqlContext.sql("select full_national_number, derived_sequence_number,
      max(ts) from (select *,to_date('1111-11-11') as ts from t1 union all
      select *,current_date from t2) unioned group by
      full_national_number").show()




      The error I get is




      Traceback (most recent call last):

      File "", line 1, in (module) File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/pyspark/sql/context.py",
      line 580, in sql
      return DataFrame(self._ssql_ctx.sql(sqlQuery), self)

      File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call

      File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/pyspark/sql/utils.py",
      line 51, in deco
      raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u"expression
      'derived_sequence_number' is neither present in the group by, nor is
      it an aggregate function. Add to group by or wrap in first() (or
      first_value) if you don't care which value you get.;"




      Please suggest me a solution for this.






      +--------------------+-----------------------+----------+
      |full_national_number|derived_sequence_number| ts|
      +--------------------+-----------------------+----------+
      | A00000001 | 0000|1111-11-11|
      | A00000001 | 0001|1111-11-11|
      | A00000001 | 0002|1111-11-11|
      | A00000002 | 0000|2018-11-16|
      | A00000003 | 0000|2018-11-16|
      | A00000004 | 0000|2018-11-16|
      | A00000005 | 0000|2018-11-16|
      | A00000008 | 0000|2018-11-16|
      | A00000008 | 0001|2018-11-16|
      | A00000008 | 0002|2018-11-16|
      | A000000010 | 0000|1111-11-11|
      | A000000011 | 0000|1111-11-11|
      +--------------------+-----------------------+----------+





      +--------------------+-----------------------+----------+
      |full_national_number|derived_sequence_number| ts|
      +--------------------+-----------------------+----------+
      | A00000001 | 0000|1111-11-11|
      | A00000001 | 0001|1111-11-11|
      | A00000001 | 0002|1111-11-11|
      | A00000002 | 0000|2018-11-16|
      | A00000003 | 0000|2018-11-16|
      | A00000004 | 0000|2018-11-16|
      | A00000005 | 0000|2018-11-16|
      | A00000008 | 0000|2018-11-16|
      | A00000008 | 0001|2018-11-16|
      | A00000008 | 0002|2018-11-16|
      | A000000010 | 0000|1111-11-11|
      | A000000011 | 0000|1111-11-11|
      +--------------------+-----------------------+----------+






      python sql pyspark pyspark-sql






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 16 '18 at 5:30









      Aqueous Carlos

      373415




      373415










      asked Nov 16 '18 at 3:55









      Abhishek AllamsettyAbhishek Allamsetty

      156




      156
























          1 Answer
          1






          active

          oldest

          votes


















          1














          I think this will get you the results you want. Just pasting the SQL query:



          Select full_national_number, derived_sequence_number, ts
          FROM
          (
          select full_national_number, derived_sequence_number, ts,
          RANK() OVER(Partition by full_national_number ORDER by ts desc) as rnk
          from table
          )a
          WHERE a.rnk = 1;


          Let me know if this helps.






          share|improve this answer
























          • Awesome Manyak It worked just as expected. Thank you so much it really helped me.

            – Abhishek Allamsetty
            Nov 16 '18 at 5:14











          • @AbhishekAllamsetty Since the answer helped you, please accept it. Read the link if you have issues or let me know.

            – Mayank Porwal
            Nov 16 '18 at 5:49











          • Sorry Mayanak, I did it now. And thank you so much your lighting response which made my day. :)

            – Abhishek Allamsetty
            Nov 16 '18 at 15:02












          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53331203%2fneed-to-find-the-latest-records-for-a-composite-keys-in-spark-sql%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1














          I think this will get you the results you want. Just pasting the SQL query:



          Select full_national_number, derived_sequence_number, ts
          FROM
          (
          select full_national_number, derived_sequence_number, ts,
          RANK() OVER(Partition by full_national_number ORDER by ts desc) as rnk
          from table
          )a
          WHERE a.rnk = 1;


          Let me know if this helps.






          share|improve this answer
























          • Awesome Manyak It worked just as expected. Thank you so much it really helped me.

            – Abhishek Allamsetty
            Nov 16 '18 at 5:14











          • @AbhishekAllamsetty Since the answer helped you, please accept it. Read the link if you have issues or let me know.

            – Mayank Porwal
            Nov 16 '18 at 5:49











          • Sorry Mayanak, I did it now. And thank you so much your lighting response which made my day. :)

            – Abhishek Allamsetty
            Nov 16 '18 at 15:02
















          1














          I think this will get you the results you want. Just pasting the SQL query:



          Select full_national_number, derived_sequence_number, ts
          FROM
          (
          select full_national_number, derived_sequence_number, ts,
          RANK() OVER(Partition by full_national_number ORDER by ts desc) as rnk
          from table
          )a
          WHERE a.rnk = 1;


          Let me know if this helps.






          share|improve this answer
























          • Awesome Manyak It worked just as expected. Thank you so much it really helped me.

            – Abhishek Allamsetty
            Nov 16 '18 at 5:14











          • @AbhishekAllamsetty Since the answer helped you, please accept it. Read the link if you have issues or let me know.

            – Mayank Porwal
            Nov 16 '18 at 5:49











          • Sorry Mayanak, I did it now. And thank you so much your lighting response which made my day. :)

            – Abhishek Allamsetty
            Nov 16 '18 at 15:02














          1












          1








          1







          I think this will get you the results you want. Just pasting the SQL query:



          Select full_national_number, derived_sequence_number, ts
          FROM
          (
          select full_national_number, derived_sequence_number, ts,
          RANK() OVER(Partition by full_national_number ORDER by ts desc) as rnk
          from table
          )a
          WHERE a.rnk = 1;


          Let me know if this helps.






          share|improve this answer













          I think this will get you the results you want. Just pasting the SQL query:



          Select full_national_number, derived_sequence_number, ts
          FROM
          (
          select full_national_number, derived_sequence_number, ts,
          RANK() OVER(Partition by full_national_number ORDER by ts desc) as rnk
          from table
          )a
          WHERE a.rnk = 1;


          Let me know if this helps.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 16 '18 at 4:08









          Mayank PorwalMayank Porwal

          5,0182725




          5,0182725













          • Awesome Manyak It worked just as expected. Thank you so much it really helped me.

            – Abhishek Allamsetty
            Nov 16 '18 at 5:14











          • @AbhishekAllamsetty Since the answer helped you, please accept it. Read the link if you have issues or let me know.

            – Mayank Porwal
            Nov 16 '18 at 5:49











          • Sorry Mayanak, I did it now. And thank you so much your lighting response which made my day. :)

            – Abhishek Allamsetty
            Nov 16 '18 at 15:02



















          • Awesome Manyak It worked just as expected. Thank you so much it really helped me.

            – Abhishek Allamsetty
            Nov 16 '18 at 5:14











          • @AbhishekAllamsetty Since the answer helped you, please accept it. Read the link if you have issues or let me know.

            – Mayank Porwal
            Nov 16 '18 at 5:49











          • Sorry Mayanak, I did it now. And thank you so much your lighting response which made my day. :)

            – Abhishek Allamsetty
            Nov 16 '18 at 15:02

















          Awesome Manyak It worked just as expected. Thank you so much it really helped me.

          – Abhishek Allamsetty
          Nov 16 '18 at 5:14





          Awesome Manyak It worked just as expected. Thank you so much it really helped me.

          – Abhishek Allamsetty
          Nov 16 '18 at 5:14













          @AbhishekAllamsetty Since the answer helped you, please accept it. Read the link if you have issues or let me know.

          – Mayank Porwal
          Nov 16 '18 at 5:49





          @AbhishekAllamsetty Since the answer helped you, please accept it. Read the link if you have issues or let me know.

          – Mayank Porwal
          Nov 16 '18 at 5:49













          Sorry Mayanak, I did it now. And thank you so much your lighting response which made my day. :)

          – Abhishek Allamsetty
          Nov 16 '18 at 15:02





          Sorry Mayanak, I did it now. And thank you so much your lighting response which made my day. :)

          – Abhishek Allamsetty
          Nov 16 '18 at 15:02




















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53331203%2fneed-to-find-the-latest-records-for-a-composite-keys-in-spark-sql%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Florida Star v. B. J. F.

          Error while running script in elastic search , gateway timeout

          Adding quotations to stringified JSON object values