need to find the latest records for a composite keys in spark sql
I need to find the latest records of full_national_number
based on the date. Can someone please suggest the solution?
My data is
+--------------------+-----------------------+----------+
|full_national_number|derived_sequence_number| ts|
+--------------------+-----------------------+----------+
| A00000001 | 0000|1111-11-11|
| A00000001 | 0001|1111-11-11|
| A00000001 | 0002|1111-11-11|
| A00000002 | 0000|1111-11-11|
| A00000002 | 0001|1111-11-11|
| A00000002 | 0002|1111-11-11|
| A00000003 | 0000|1111-11-11|
| A00000003 | 0001|1111-11-11|
| A00000004 | 0000|1111-11-11|
| A000000010 | 0000|1111-11-11|
| A000000011 | 0000|1111-11-11|
| A00000008 | 0000|2018-11-16|
| A00000008 | 0001|2018-11-16|
| A00000008 | 0002|2018-11-16|
| A00000002 | 0000|2018-11-16|
| A00000003 | 0000|2018-11-16|
| A00000004 | 0000|2018-11-16|
| A00000005 | 0000|2018-11-16|
+--------------------+-----------------------+----------+
My expected output should be
+--------------------+-----------------------+----------+
|full_national_number|derived_sequence_number| ts|
+--------------------+-----------------------+----------+
| A00000001 | 0000|1111-11-11|
| A00000001 | 0001|1111-11-11|
| A00000001 | 0002|1111-11-11|
| A00000002 | 0000|2018-11-16|
| A00000003 | 0000|2018-11-16|
| A00000004 | 0000|2018-11-16|
| A00000005 | 0000|2018-11-16|
| A00000008 | 0000|2018-11-16|
| A00000008 | 0001|2018-11-16|
| A00000008 | 0002|2018-11-16|
| A000000010 | 0000|1111-11-11|
| A000000011 | 0000|1111-11-11|
+--------------------+-----------------------+----------+
I have tried the below option but got an error.
sqlContext.sql("select full_national_number, derived_sequence_number,
max(ts) from (select *,to_date('1111-11-11') as ts from t1 union all
select *,current_date from t2) unioned group by
full_national_number").show()
The error I get is
Traceback (most recent call last):
File "", line 1, in (module) File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/pyspark/sql/context.py",
line 580, in sql
return DataFrame(self._ssql_ctx.sql(sqlQuery), self)
File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call
File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/pyspark/sql/utils.py",
line 51, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u"expression
'derived_sequence_number' is neither present in the group by, nor is
it an aggregate function. Add to group by or wrap in first() (or
first_value) if you don't care which value you get.;"
Please suggest me a solution for this.
python sql pyspark pyspark-sql
add a comment |
I need to find the latest records of full_national_number
based on the date. Can someone please suggest the solution?
My data is
+--------------------+-----------------------+----------+
|full_national_number|derived_sequence_number| ts|
+--------------------+-----------------------+----------+
| A00000001 | 0000|1111-11-11|
| A00000001 | 0001|1111-11-11|
| A00000001 | 0002|1111-11-11|
| A00000002 | 0000|1111-11-11|
| A00000002 | 0001|1111-11-11|
| A00000002 | 0002|1111-11-11|
| A00000003 | 0000|1111-11-11|
| A00000003 | 0001|1111-11-11|
| A00000004 | 0000|1111-11-11|
| A000000010 | 0000|1111-11-11|
| A000000011 | 0000|1111-11-11|
| A00000008 | 0000|2018-11-16|
| A00000008 | 0001|2018-11-16|
| A00000008 | 0002|2018-11-16|
| A00000002 | 0000|2018-11-16|
| A00000003 | 0000|2018-11-16|
| A00000004 | 0000|2018-11-16|
| A00000005 | 0000|2018-11-16|
+--------------------+-----------------------+----------+
My expected output should be
+--------------------+-----------------------+----------+
|full_national_number|derived_sequence_number| ts|
+--------------------+-----------------------+----------+
| A00000001 | 0000|1111-11-11|
| A00000001 | 0001|1111-11-11|
| A00000001 | 0002|1111-11-11|
| A00000002 | 0000|2018-11-16|
| A00000003 | 0000|2018-11-16|
| A00000004 | 0000|2018-11-16|
| A00000005 | 0000|2018-11-16|
| A00000008 | 0000|2018-11-16|
| A00000008 | 0001|2018-11-16|
| A00000008 | 0002|2018-11-16|
| A000000010 | 0000|1111-11-11|
| A000000011 | 0000|1111-11-11|
+--------------------+-----------------------+----------+
I have tried the below option but got an error.
sqlContext.sql("select full_national_number, derived_sequence_number,
max(ts) from (select *,to_date('1111-11-11') as ts from t1 union all
select *,current_date from t2) unioned group by
full_national_number").show()
The error I get is
Traceback (most recent call last):
File "", line 1, in (module) File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/pyspark/sql/context.py",
line 580, in sql
return DataFrame(self._ssql_ctx.sql(sqlQuery), self)
File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call
File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/pyspark/sql/utils.py",
line 51, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u"expression
'derived_sequence_number' is neither present in the group by, nor is
it an aggregate function. Add to group by or wrap in first() (or
first_value) if you don't care which value you get.;"
Please suggest me a solution for this.
python sql pyspark pyspark-sql
add a comment |
I need to find the latest records of full_national_number
based on the date. Can someone please suggest the solution?
My data is
+--------------------+-----------------------+----------+
|full_national_number|derived_sequence_number| ts|
+--------------------+-----------------------+----------+
| A00000001 | 0000|1111-11-11|
| A00000001 | 0001|1111-11-11|
| A00000001 | 0002|1111-11-11|
| A00000002 | 0000|1111-11-11|
| A00000002 | 0001|1111-11-11|
| A00000002 | 0002|1111-11-11|
| A00000003 | 0000|1111-11-11|
| A00000003 | 0001|1111-11-11|
| A00000004 | 0000|1111-11-11|
| A000000010 | 0000|1111-11-11|
| A000000011 | 0000|1111-11-11|
| A00000008 | 0000|2018-11-16|
| A00000008 | 0001|2018-11-16|
| A00000008 | 0002|2018-11-16|
| A00000002 | 0000|2018-11-16|
| A00000003 | 0000|2018-11-16|
| A00000004 | 0000|2018-11-16|
| A00000005 | 0000|2018-11-16|
+--------------------+-----------------------+----------+
My expected output should be
+--------------------+-----------------------+----------+
|full_national_number|derived_sequence_number| ts|
+--------------------+-----------------------+----------+
| A00000001 | 0000|1111-11-11|
| A00000001 | 0001|1111-11-11|
| A00000001 | 0002|1111-11-11|
| A00000002 | 0000|2018-11-16|
| A00000003 | 0000|2018-11-16|
| A00000004 | 0000|2018-11-16|
| A00000005 | 0000|2018-11-16|
| A00000008 | 0000|2018-11-16|
| A00000008 | 0001|2018-11-16|
| A00000008 | 0002|2018-11-16|
| A000000010 | 0000|1111-11-11|
| A000000011 | 0000|1111-11-11|
+--------------------+-----------------------+----------+
I have tried the below option but got an error.
sqlContext.sql("select full_national_number, derived_sequence_number,
max(ts) from (select *,to_date('1111-11-11') as ts from t1 union all
select *,current_date from t2) unioned group by
full_national_number").show()
The error I get is
Traceback (most recent call last):
File "", line 1, in (module) File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/pyspark/sql/context.py",
line 580, in sql
return DataFrame(self._ssql_ctx.sql(sqlQuery), self)
File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call
File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/pyspark/sql/utils.py",
line 51, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u"expression
'derived_sequence_number' is neither present in the group by, nor is
it an aggregate function. Add to group by or wrap in first() (or
first_value) if you don't care which value you get.;"
Please suggest me a solution for this.
python sql pyspark pyspark-sql
I need to find the latest records of full_national_number
based on the date. Can someone please suggest the solution?
My data is
+--------------------+-----------------------+----------+
|full_national_number|derived_sequence_number| ts|
+--------------------+-----------------------+----------+
| A00000001 | 0000|1111-11-11|
| A00000001 | 0001|1111-11-11|
| A00000001 | 0002|1111-11-11|
| A00000002 | 0000|1111-11-11|
| A00000002 | 0001|1111-11-11|
| A00000002 | 0002|1111-11-11|
| A00000003 | 0000|1111-11-11|
| A00000003 | 0001|1111-11-11|
| A00000004 | 0000|1111-11-11|
| A000000010 | 0000|1111-11-11|
| A000000011 | 0000|1111-11-11|
| A00000008 | 0000|2018-11-16|
| A00000008 | 0001|2018-11-16|
| A00000008 | 0002|2018-11-16|
| A00000002 | 0000|2018-11-16|
| A00000003 | 0000|2018-11-16|
| A00000004 | 0000|2018-11-16|
| A00000005 | 0000|2018-11-16|
+--------------------+-----------------------+----------+
My expected output should be
+--------------------+-----------------------+----------+
|full_national_number|derived_sequence_number| ts|
+--------------------+-----------------------+----------+
| A00000001 | 0000|1111-11-11|
| A00000001 | 0001|1111-11-11|
| A00000001 | 0002|1111-11-11|
| A00000002 | 0000|2018-11-16|
| A00000003 | 0000|2018-11-16|
| A00000004 | 0000|2018-11-16|
| A00000005 | 0000|2018-11-16|
| A00000008 | 0000|2018-11-16|
| A00000008 | 0001|2018-11-16|
| A00000008 | 0002|2018-11-16|
| A000000010 | 0000|1111-11-11|
| A000000011 | 0000|1111-11-11|
+--------------------+-----------------------+----------+
I have tried the below option but got an error.
sqlContext.sql("select full_national_number, derived_sequence_number,
max(ts) from (select *,to_date('1111-11-11') as ts from t1 union all
select *,current_date from t2) unioned group by
full_national_number").show()
The error I get is
Traceback (most recent call last):
File "", line 1, in (module) File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/pyspark/sql/context.py",
line 580, in sql
return DataFrame(self._ssql_ctx.sql(sqlQuery), self)
File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call
File "/opt/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/lib/spark/python/pyspark/sql/utils.py",
line 51, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u"expression
'derived_sequence_number' is neither present in the group by, nor is
it an aggregate function. Add to group by or wrap in first() (or
first_value) if you don't care which value you get.;"
Please suggest me a solution for this.
+--------------------+-----------------------+----------+
|full_national_number|derived_sequence_number| ts|
+--------------------+-----------------------+----------+
| A00000001 | 0000|1111-11-11|
| A00000001 | 0001|1111-11-11|
| A00000001 | 0002|1111-11-11|
| A00000002 | 0000|2018-11-16|
| A00000003 | 0000|2018-11-16|
| A00000004 | 0000|2018-11-16|
| A00000005 | 0000|2018-11-16|
| A00000008 | 0000|2018-11-16|
| A00000008 | 0001|2018-11-16|
| A00000008 | 0002|2018-11-16|
| A000000010 | 0000|1111-11-11|
| A000000011 | 0000|1111-11-11|
+--------------------+-----------------------+----------+
+--------------------+-----------------------+----------+
|full_national_number|derived_sequence_number| ts|
+--------------------+-----------------------+----------+
| A00000001 | 0000|1111-11-11|
| A00000001 | 0001|1111-11-11|
| A00000001 | 0002|1111-11-11|
| A00000002 | 0000|2018-11-16|
| A00000003 | 0000|2018-11-16|
| A00000004 | 0000|2018-11-16|
| A00000005 | 0000|2018-11-16|
| A00000008 | 0000|2018-11-16|
| A00000008 | 0001|2018-11-16|
| A00000008 | 0002|2018-11-16|
| A000000010 | 0000|1111-11-11|
| A000000011 | 0000|1111-11-11|
+--------------------+-----------------------+----------+
python sql pyspark pyspark-sql
python sql pyspark pyspark-sql
edited Nov 16 '18 at 5:30
Aqueous Carlos
373415
373415
asked Nov 16 '18 at 3:55
Abhishek AllamsettyAbhishek Allamsetty
156
156
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
I think this will get you the results you want. Just pasting the SQL query:
Select full_national_number, derived_sequence_number, ts
FROM
(
select full_national_number, derived_sequence_number, ts,
RANK() OVER(Partition by full_national_number ORDER by ts desc) as rnk
from table
)a
WHERE a.rnk = 1;
Let me know if this helps.
Awesome Manyak It worked just as expected. Thank you so much it really helped me.
– Abhishek Allamsetty
Nov 16 '18 at 5:14
@AbhishekAllamsetty Since the answer helped you, please accept it. Read the link if you have issues or let me know.
– Mayank Porwal
Nov 16 '18 at 5:49
Sorry Mayanak, I did it now. And thank you so much your lighting response which made my day. :)
– Abhishek Allamsetty
Nov 16 '18 at 15:02
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53331203%2fneed-to-find-the-latest-records-for-a-composite-keys-in-spark-sql%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I think this will get you the results you want. Just pasting the SQL query:
Select full_national_number, derived_sequence_number, ts
FROM
(
select full_national_number, derived_sequence_number, ts,
RANK() OVER(Partition by full_national_number ORDER by ts desc) as rnk
from table
)a
WHERE a.rnk = 1;
Let me know if this helps.
Awesome Manyak It worked just as expected. Thank you so much it really helped me.
– Abhishek Allamsetty
Nov 16 '18 at 5:14
@AbhishekAllamsetty Since the answer helped you, please accept it. Read the link if you have issues or let me know.
– Mayank Porwal
Nov 16 '18 at 5:49
Sorry Mayanak, I did it now. And thank you so much your lighting response which made my day. :)
– Abhishek Allamsetty
Nov 16 '18 at 15:02
add a comment |
I think this will get you the results you want. Just pasting the SQL query:
Select full_national_number, derived_sequence_number, ts
FROM
(
select full_national_number, derived_sequence_number, ts,
RANK() OVER(Partition by full_national_number ORDER by ts desc) as rnk
from table
)a
WHERE a.rnk = 1;
Let me know if this helps.
Awesome Manyak It worked just as expected. Thank you so much it really helped me.
– Abhishek Allamsetty
Nov 16 '18 at 5:14
@AbhishekAllamsetty Since the answer helped you, please accept it. Read the link if you have issues or let me know.
– Mayank Porwal
Nov 16 '18 at 5:49
Sorry Mayanak, I did it now. And thank you so much your lighting response which made my day. :)
– Abhishek Allamsetty
Nov 16 '18 at 15:02
add a comment |
I think this will get you the results you want. Just pasting the SQL query:
Select full_national_number, derived_sequence_number, ts
FROM
(
select full_national_number, derived_sequence_number, ts,
RANK() OVER(Partition by full_national_number ORDER by ts desc) as rnk
from table
)a
WHERE a.rnk = 1;
Let me know if this helps.
I think this will get you the results you want. Just pasting the SQL query:
Select full_national_number, derived_sequence_number, ts
FROM
(
select full_national_number, derived_sequence_number, ts,
RANK() OVER(Partition by full_national_number ORDER by ts desc) as rnk
from table
)a
WHERE a.rnk = 1;
Let me know if this helps.
answered Nov 16 '18 at 4:08
Mayank PorwalMayank Porwal
5,0182725
5,0182725
Awesome Manyak It worked just as expected. Thank you so much it really helped me.
– Abhishek Allamsetty
Nov 16 '18 at 5:14
@AbhishekAllamsetty Since the answer helped you, please accept it. Read the link if you have issues or let me know.
– Mayank Porwal
Nov 16 '18 at 5:49
Sorry Mayanak, I did it now. And thank you so much your lighting response which made my day. :)
– Abhishek Allamsetty
Nov 16 '18 at 15:02
add a comment |
Awesome Manyak It worked just as expected. Thank you so much it really helped me.
– Abhishek Allamsetty
Nov 16 '18 at 5:14
@AbhishekAllamsetty Since the answer helped you, please accept it. Read the link if you have issues or let me know.
– Mayank Porwal
Nov 16 '18 at 5:49
Sorry Mayanak, I did it now. And thank you so much your lighting response which made my day. :)
– Abhishek Allamsetty
Nov 16 '18 at 15:02
Awesome Manyak It worked just as expected. Thank you so much it really helped me.
– Abhishek Allamsetty
Nov 16 '18 at 5:14
Awesome Manyak It worked just as expected. Thank you so much it really helped me.
– Abhishek Allamsetty
Nov 16 '18 at 5:14
@AbhishekAllamsetty Since the answer helped you, please accept it. Read the link if you have issues or let me know.
– Mayank Porwal
Nov 16 '18 at 5:49
@AbhishekAllamsetty Since the answer helped you, please accept it. Read the link if you have issues or let me know.
– Mayank Porwal
Nov 16 '18 at 5:49
Sorry Mayanak, I did it now. And thank you so much your lighting response which made my day. :)
– Abhishek Allamsetty
Nov 16 '18 at 15:02
Sorry Mayanak, I did it now. And thank you so much your lighting response which made my day. :)
– Abhishek Allamsetty
Nov 16 '18 at 15:02
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53331203%2fneed-to-find-the-latest-records-for-a-composite-keys-in-spark-sql%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown