need to find the latest records for a composite keys in spark sql

I need to find the latest records of full_national_number based on the date. Can someone please suggest the solution?

My data is

+--------------------+-----------------------+----------+ |full_national_number|derived_sequence_number| ts| +--------------------+-----------------------+----------+ | A00000001 | 0000|1111-11-11| | A00000001 | 0001|1111-11-11| | A00000001 | 0002|1111-11-11| | A00000002 | 0000|1111-11-11| | A00000002 | 0001|1111-11-11| | A00000002 | 0002|1111-11-11| | A00000003 | 0000|1111-11-11| | A00000003 | 0001|1111-11-11| | A00000004 | 0000|1111-11-11| | A000000010 | 0000|1111-11-11| | A000000011 | 0000|1111-11-11| | A00000008 | 0000|2018-11-16| | A00000008 | 0001|2018-11-16| | A00000008 | 0002|2018-11-16| | A00000002 | 0000|2018-11-16| | A00000003 | 0000|2018-11-16| | A00000004 | 0000|2018-11-16| | A00000005 | 0000|2018-11-16| +--------------------+-----------------------+----------+

My expected output should be

+--------------------+-----------------------+----------+

|full_national_number|derived_sequence_number|        ts|

+--------------------+-----------------------+----------+

|    A00000001       |                   0000|1111-11-11|

|     A00000001      |                   0001|1111-11-11|

|    A00000001       |                   0002|1111-11-11|

|    A00000002       |                   0000|2018-11-16|

|    A00000003       |                   0000|2018-11-16|

|    A00000004       |                   0000|2018-11-16|

|    A00000005       |                   0000|2018-11-16|

|    A00000008       |                   0000|2018-11-16|

|    A00000008       |                   0001|2018-11-16|

|    A00000008       |                   0002|2018-11-16|

|   A000000010       |                   0000|1111-11-11|

|   A000000011       |                   0000|1111-11-11|

+--------------------+-----------------------+----------+

I have tried the below option but got an error.

sqlContext.sql("select full_national_number, derived_sequence_number,
max(ts) from (select *,to_date('1111-11-11') as ts from t1 union all
select *,current_date from t2) unioned group by
full_national_number").show()

The error I get is

Traceback (most recent call last):

File "", line 1, in (module) File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/pyspark/sql/context.py",
line 580, in sql
return DataFrame(self._ssql_ctx.sql(sqlQuery), self)

File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call

File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/pyspark/sql/utils.py",
line 51, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u"expression
'derived_sequence_number' is neither present in the group by, nor is
it an aggregate function. Add to group by or wrap in first() (or
first_value) if you don't care which value you get.;"

Please suggest me a solution for this.

edited Nov 16 '18 at 5:30

Aqueous Carlos

373415

asked Nov 16 '18 at 3:55

Abhishek Allamsetty

156

add a comment |

I need to find the latest records of full_national_number based on the date. Can someone please suggest the solution?

My data is

My expected output should be

+--------------------+-----------------------+----------+

|full_national_number|derived_sequence_number|        ts|

+--------------------+-----------------------+----------+

|    A00000001       |                   0000|1111-11-11|

|     A00000001      |                   0001|1111-11-11|

|    A00000001       |                   0002|1111-11-11|

|    A00000002       |                   0000|2018-11-16|

|    A00000003       |                   0000|2018-11-16|

|    A00000004       |                   0000|2018-11-16|

|    A00000005       |                   0000|2018-11-16|

|    A00000008       |                   0000|2018-11-16|

|    A00000008       |                   0001|2018-11-16|

|    A00000008       |                   0002|2018-11-16|

|   A000000010       |                   0000|1111-11-11|

|   A000000011       |                   0000|1111-11-11|

+--------------------+-----------------------+----------+

I have tried the below option but got an error.

sqlContext.sql("select full_national_number, derived_sequence_number,
max(ts) from (select *,to_date('1111-11-11') as ts from t1 union all
select *,current_date from t2) unioned group by
full_national_number").show()

The error I get is

Traceback (most recent call last):

File "", line 1, in (module) File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/pyspark/sql/context.py",
line 580, in sql
return DataFrame(self._ssql_ctx.sql(sqlQuery), self)

File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call

File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/pyspark/sql/utils.py",
line 51, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u"expression
'derived_sequence_number' is neither present in the group by, nor is
it an aggregate function. Add to group by or wrap in first() (or
first_value) if you don't care which value you get.;"

Please suggest me a solution for this.

edited Nov 16 '18 at 5:30

Aqueous Carlos

373415

asked Nov 16 '18 at 3:55

Abhishek Allamsetty

156

add a comment |

I need to find the latest records of full_national_number based on the date. Can someone please suggest the solution?

My data is

My expected output should be

+--------------------+-----------------------+----------+

|full_national_number|derived_sequence_number|        ts|

+--------------------+-----------------------+----------+

|    A00000001       |                   0000|1111-11-11|

|     A00000001      |                   0001|1111-11-11|

|    A00000001       |                   0002|1111-11-11|

|    A00000002       |                   0000|2018-11-16|

|    A00000003       |                   0000|2018-11-16|

|    A00000004       |                   0000|2018-11-16|

|    A00000005       |                   0000|2018-11-16|

|    A00000008       |                   0000|2018-11-16|

|    A00000008       |                   0001|2018-11-16|

|    A00000008       |                   0002|2018-11-16|

|   A000000010       |                   0000|1111-11-11|

|   A000000011       |                   0000|1111-11-11|

+--------------------+-----------------------+----------+

I have tried the below option but got an error.

sqlContext.sql("select full_national_number, derived_sequence_number,
max(ts) from (select *,to_date('1111-11-11') as ts from t1 union all
select *,current_date from t2) unioned group by
full_national_number").show()

The error I get is

Traceback (most recent call last):

File "", line 1, in (module) File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/pyspark/sql/context.py",
line 580, in sql
return DataFrame(self._ssql_ctx.sql(sqlQuery), self)

File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call

File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/pyspark/sql/utils.py",
line 51, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u"expression
'derived_sequence_number' is neither present in the group by, nor is
it an aggregate function. Add to group by or wrap in first() (or
first_value) if you don't care which value you get.;"

Please suggest me a solution for this.

edited Nov 16 '18 at 5:30

Aqueous Carlos

373415

asked Nov 16 '18 at 3:55

Abhishek Allamsetty

156

I need to find the latest records of full_national_number based on the date. Can someone please suggest the solution?

My data is

My expected output should be

+--------------------+-----------------------+----------+

|full_national_number|derived_sequence_number|        ts|

+--------------------+-----------------------+----------+

|    A00000001       |                   0000|1111-11-11|

|     A00000001      |                   0001|1111-11-11|

|    A00000001       |                   0002|1111-11-11|

|    A00000002       |                   0000|2018-11-16|

|    A00000003       |                   0000|2018-11-16|

|    A00000004       |                   0000|2018-11-16|

|    A00000005       |                   0000|2018-11-16|

|    A00000008       |                   0000|2018-11-16|

|    A00000008       |                   0001|2018-11-16|

|    A00000008       |                   0002|2018-11-16|

|   A000000010       |                   0000|1111-11-11|

|   A000000011       |                   0000|1111-11-11|

+--------------------+-----------------------+----------+

I have tried the below option but got an error.

sqlContext.sql("select full_national_number, derived_sequence_number,
max(ts) from (select *,to_date('1111-11-11') as ts from t1 union all
select *,current_date from t2) unioned group by
full_national_number").show()

The error I get is

Traceback (most recent call last):

File "", line 1, in (module) File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/pyspark/sql/context.py",
line 580, in sql
return DataFrame(self._ssql_ctx.sql(sqlQuery), self)

File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call

File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/pyspark/sql/utils.py",
line 51, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u"expression
'derived_sequence_number' is neither present in the group by, nor is
it an aggregate function. Add to group by or wrap in first() (or
first_value) if you don't care which value you get.;"

Please suggest me a solution for this.

+--------------------+-----------------------+----------+

|full_national_number|derived_sequence_number|        ts|

+--------------------+-----------------------+----------+

|    A00000001       |                   0000|1111-11-11|

|     A00000001      |                   0001|1111-11-11|

|    A00000001       |                   0002|1111-11-11|

|    A00000002       |                   0000|2018-11-16|

|    A00000003       |                   0000|2018-11-16|

|    A00000004       |                   0000|2018-11-16|

|    A00000005       |                   0000|2018-11-16|

|    A00000008       |                   0000|2018-11-16|

|    A00000008       |                   0001|2018-11-16|

|    A00000008       |                   0002|2018-11-16|

|   A000000010       |                   0000|1111-11-11|

|   A000000011       |                   0000|1111-11-11|

+--------------------+-----------------------+----------+

+--------------------+-----------------------+----------+

|full_national_number|derived_sequence_number|        ts|

+--------------------+-----------------------+----------+

|    A00000001       |                   0000|1111-11-11|

|     A00000001      |                   0001|1111-11-11|

|    A00000001       |                   0002|1111-11-11|

|    A00000002       |                   0000|2018-11-16|

|    A00000003       |                   0000|2018-11-16|

|    A00000004       |                   0000|2018-11-16|

|    A00000005       |                   0000|2018-11-16|

|    A00000008       |                   0000|2018-11-16|

|    A00000008       |                   0001|2018-11-16|

|    A00000008       |                   0002|2018-11-16|

|   A000000010       |                   0000|1111-11-11|

|   A000000011       |                   0000|1111-11-11|

+--------------------+-----------------------+----------+

python sql pyspark pyspark-sql

edited Nov 16 '18 at 5:30

Aqueous Carlos

373415

asked Nov 16 '18 at 3:55

Abhishek Allamsetty

156

edited Nov 16 '18 at 5:30

Aqueous Carlos

373415

asked Nov 16 '18 at 3:55

Abhishek Allamsetty

156

edited Nov 16 '18 at 5:30

Aqueous Carlos

373415

edited Nov 16 '18 at 5:30

Aqueous Carlos

373415

edited Nov 16 '18 at 5:30

Aqueous Carlos

373415

asked Nov 16 '18 at 3:55

Abhishek Allamsetty

156

asked Nov 16 '18 at 3:55

Abhishek Allamsetty

156

asked Nov 16 '18 at 3:55

Abhishek Allamsetty

156

add a comment |

1 Answer
1

active

oldest

votes

I think this will get you the results you want. Just pasting the SQL query:

Select full_national_number, derived_sequence_number, ts

FROM

(

select full_national_number, derived_sequence_number, ts, 

RANK() OVER(Partition by full_national_number ORDER by ts desc) as rnk

from table

)a

WHERE a.rnk = 1;

Let me know if this helps.

answered Nov 16 '18 at 4:08

Mayank Porwal

5,0182725

Awesome Manyak It worked just as expected. Thank you so much it really helped me.

– Abhishek Allamsetty
Nov 16 '18 at 5:14

@AbhishekAllamsetty Since the answer helped you, please accept it. Read the link if you have issues or let me know.

– Mayank Porwal
Nov 16 '18 at 5:49

Sorry Mayanak, I did it now. And thank you so much your lighting response which made my day. :)

– Abhishek Allamsetty
Nov 16 '18 at 15:02

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53331203%2fneed-to-find-the-latest-records-for-a-composite-keys-in-spark-sql%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

I think this will get you the results you want. Just pasting the SQL query:

Select full_national_number, derived_sequence_number, ts

FROM

(

select full_national_number, derived_sequence_number, ts, 

RANK() OVER(Partition by full_national_number ORDER by ts desc) as rnk

from table

)a

WHERE a.rnk = 1;

Let me know if this helps.

answered Nov 16 '18 at 4:08

Mayank Porwal

5,0182725

Awesome Manyak It worked just as expected. Thank you so much it really helped me.

– Abhishek Allamsetty
Nov 16 '18 at 5:14

@AbhishekAllamsetty Since the answer helped you, please accept it. Read the link if you have issues or let me know.

– Mayank Porwal
Nov 16 '18 at 5:49

Sorry Mayanak, I did it now. And thank you so much your lighting response which made my day. :)

– Abhishek Allamsetty
Nov 16 '18 at 15:02

add a comment |

I think this will get you the results you want. Just pasting the SQL query:

Select full_national_number, derived_sequence_number, ts

FROM

(

select full_national_number, derived_sequence_number, ts, 

RANK() OVER(Partition by full_national_number ORDER by ts desc) as rnk

from table

)a

WHERE a.rnk = 1;

Let me know if this helps.

answered Nov 16 '18 at 4:08

Mayank Porwal

5,0182725

Awesome Manyak It worked just as expected. Thank you so much it really helped me.

– Abhishek Allamsetty
Nov 16 '18 at 5:14

@AbhishekAllamsetty Since the answer helped you, please accept it. Read the link if you have issues or let me know.

– Mayank Porwal
Nov 16 '18 at 5:49

Sorry Mayanak, I did it now. And thank you so much your lighting response which made my day. :)

– Abhishek Allamsetty
Nov 16 '18 at 15:02

add a comment |

I think this will get you the results you want. Just pasting the SQL query:

Select full_national_number, derived_sequence_number, ts

FROM

(

select full_national_number, derived_sequence_number, ts, 

RANK() OVER(Partition by full_national_number ORDER by ts desc) as rnk

from table

)a

WHERE a.rnk = 1;

Let me know if this helps.

answered Nov 16 '18 at 4:08

Mayank Porwal

5,0182725

I think this will get you the results you want. Just pasting the SQL query:

Select full_national_number, derived_sequence_number, ts

FROM

(

select full_national_number, derived_sequence_number, ts, 

RANK() OVER(Partition by full_national_number ORDER by ts desc) as rnk

from table

)a

WHERE a.rnk = 1;

Let me know if this helps.

answered Nov 16 '18 at 4:08

Mayank Porwal

5,0182725

answered Nov 16 '18 at 4:08

Mayank Porwal

5,0182725

answered Nov 16 '18 at 4:08

Mayank Porwal

5,0182725

answered Nov 16 '18 at 4:08

Mayank Porwal

5,0182725

Awesome Manyak It worked just as expected. Thank you so much it really helped me.

– Abhishek Allamsetty
Nov 16 '18 at 5:14

@AbhishekAllamsetty Since the answer helped you, please accept it. Read the link if you have issues or let me know.

– Mayank Porwal
Nov 16 '18 at 5:49

Sorry Mayanak, I did it now. And thank you so much your lighting response which made my day. :)

– Abhishek Allamsetty
Nov 16 '18 at 15:02

add a comment |

Awesome Manyak It worked just as expected. Thank you so much it really helped me.

– Abhishek Allamsetty
Nov 16 '18 at 5:14

@AbhishekAllamsetty Since the answer helped you, please accept it. Read the link if you have issues or let me know.

– Mayank Porwal
Nov 16 '18 at 5:49

Sorry Mayanak, I did it now. And thank you so much your lighting response which made my day. :)

– Abhishek Allamsetty
Nov 16 '18 at 15:02

Awesome Manyak It worked just as expected. Thank you so much it really helped me.

– Abhishek Allamsetty
Nov 16 '18 at 5:14

@AbhishekAllamsetty Since the answer helped you, please accept it. Read the link if you have issues or let me know.

– Mayank Porwal
Nov 16 '18 at 5:49

Sorry Mayanak, I did it now. And thank you so much your lighting response which made my day. :)

– Abhishek Allamsetty
Nov 16 '18 at 15:02

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

AI,ituBREJ,u sGPqzbOGngnaEAE3d

搜尋此網誌

Ndtyjky