How to check if panda dataframe group have same data












1















I have a pandas dataframe as below



id  name  Base   field1    field2           field3
1 AA Y Yes Consumer Not Applicable
1 BB N Yes Consumer Not Applicable
2 CC Y Yes Consumer Not Applicable
2 DD N Yes Not Applicable Not Applicable
2 EE N No Not Applicable Modified
3 FF Y Yes Not Applicable Applicable
3 GG N Yes Not Applicable Not Applicable
3 HH N Yes Not Applicable Not Applicable


The expected result is to group this dataframe based on the ID column and check if the data on all the other columns are the same data in each group, and finally write the results.



I tried this to validate the data on each group but it always says TRUE



Code:



result_list=
for col in df.columns:
result = df.groupby(level=0)[col].apply(lambda x: len(set(x))==1)
result_list.append(result)

final = pd.concat(result_list,1)


The expected result is



id  name     field1   field2           field3           Error
1 AA Yes Consumer Not Applicable Pass
1 BB Yes Consumer Not Applicable Pass
2 CC Yes Consumer Not Applicable field1, field2, field3 mismatch for ID: 2
2 DD Yes Not Applicable Not Applicable field1, field2, field3 mismatch for ID: 2
2 EE No Not Applicable Modified field1, field2, field3 mismatch for ID: 2
3 FF Yes Not Applicable Applicable field3 mismatch for ID: 3
3 GG Yes Not Applicable Not Applicable field3 mismatch for ID: 3
3 HH Yes Not Applicable Not Applicable field3 mismatch for ID: 3


Any help on this?










share|improve this question

























  • What's your desired result, only id = 1 passes your test?

    – jpp
    Nov 13 '18 at 13:38











  • Hi, I've updated the dataframe and expected result. let me know if it helps

    – Osceria
    Nov 13 '18 at 14:02
















1















I have a pandas dataframe as below



id  name  Base   field1    field2           field3
1 AA Y Yes Consumer Not Applicable
1 BB N Yes Consumer Not Applicable
2 CC Y Yes Consumer Not Applicable
2 DD N Yes Not Applicable Not Applicable
2 EE N No Not Applicable Modified
3 FF Y Yes Not Applicable Applicable
3 GG N Yes Not Applicable Not Applicable
3 HH N Yes Not Applicable Not Applicable


The expected result is to group this dataframe based on the ID column and check if the data on all the other columns are the same data in each group, and finally write the results.



I tried this to validate the data on each group but it always says TRUE



Code:



result_list=
for col in df.columns:
result = df.groupby(level=0)[col].apply(lambda x: len(set(x))==1)
result_list.append(result)

final = pd.concat(result_list,1)


The expected result is



id  name     field1   field2           field3           Error
1 AA Yes Consumer Not Applicable Pass
1 BB Yes Consumer Not Applicable Pass
2 CC Yes Consumer Not Applicable field1, field2, field3 mismatch for ID: 2
2 DD Yes Not Applicable Not Applicable field1, field2, field3 mismatch for ID: 2
2 EE No Not Applicable Modified field1, field2, field3 mismatch for ID: 2
3 FF Yes Not Applicable Applicable field3 mismatch for ID: 3
3 GG Yes Not Applicable Not Applicable field3 mismatch for ID: 3
3 HH Yes Not Applicable Not Applicable field3 mismatch for ID: 3


Any help on this?










share|improve this question

























  • What's your desired result, only id = 1 passes your test?

    – jpp
    Nov 13 '18 at 13:38











  • Hi, I've updated the dataframe and expected result. let me know if it helps

    – Osceria
    Nov 13 '18 at 14:02














1












1








1


0






I have a pandas dataframe as below



id  name  Base   field1    field2           field3
1 AA Y Yes Consumer Not Applicable
1 BB N Yes Consumer Not Applicable
2 CC Y Yes Consumer Not Applicable
2 DD N Yes Not Applicable Not Applicable
2 EE N No Not Applicable Modified
3 FF Y Yes Not Applicable Applicable
3 GG N Yes Not Applicable Not Applicable
3 HH N Yes Not Applicable Not Applicable


The expected result is to group this dataframe based on the ID column and check if the data on all the other columns are the same data in each group, and finally write the results.



I tried this to validate the data on each group but it always says TRUE



Code:



result_list=
for col in df.columns:
result = df.groupby(level=0)[col].apply(lambda x: len(set(x))==1)
result_list.append(result)

final = pd.concat(result_list,1)


The expected result is



id  name     field1   field2           field3           Error
1 AA Yes Consumer Not Applicable Pass
1 BB Yes Consumer Not Applicable Pass
2 CC Yes Consumer Not Applicable field1, field2, field3 mismatch for ID: 2
2 DD Yes Not Applicable Not Applicable field1, field2, field3 mismatch for ID: 2
2 EE No Not Applicable Modified field1, field2, field3 mismatch for ID: 2
3 FF Yes Not Applicable Applicable field3 mismatch for ID: 3
3 GG Yes Not Applicable Not Applicable field3 mismatch for ID: 3
3 HH Yes Not Applicable Not Applicable field3 mismatch for ID: 3


Any help on this?










share|improve this question
















I have a pandas dataframe as below



id  name  Base   field1    field2           field3
1 AA Y Yes Consumer Not Applicable
1 BB N Yes Consumer Not Applicable
2 CC Y Yes Consumer Not Applicable
2 DD N Yes Not Applicable Not Applicable
2 EE N No Not Applicable Modified
3 FF Y Yes Not Applicable Applicable
3 GG N Yes Not Applicable Not Applicable
3 HH N Yes Not Applicable Not Applicable


The expected result is to group this dataframe based on the ID column and check if the data on all the other columns are the same data in each group, and finally write the results.



I tried this to validate the data on each group but it always says TRUE



Code:



result_list=
for col in df.columns:
result = df.groupby(level=0)[col].apply(lambda x: len(set(x))==1)
result_list.append(result)

final = pd.concat(result_list,1)


The expected result is



id  name     field1   field2           field3           Error
1 AA Yes Consumer Not Applicable Pass
1 BB Yes Consumer Not Applicable Pass
2 CC Yes Consumer Not Applicable field1, field2, field3 mismatch for ID: 2
2 DD Yes Not Applicable Not Applicable field1, field2, field3 mismatch for ID: 2
2 EE No Not Applicable Modified field1, field2, field3 mismatch for ID: 2
3 FF Yes Not Applicable Applicable field3 mismatch for ID: 3
3 GG Yes Not Applicable Not Applicable field3 mismatch for ID: 3
3 HH Yes Not Applicable Not Applicable field3 mismatch for ID: 3


Any help on this?







python-3.x pandas dataframe pandas-groupby






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 14 '18 at 10:23









Akhilesh Pandey

549313




549313










asked Nov 13 '18 at 12:52









OsceriaOsceria

599




599













  • What's your desired result, only id = 1 passes your test?

    – jpp
    Nov 13 '18 at 13:38











  • Hi, I've updated the dataframe and expected result. let me know if it helps

    – Osceria
    Nov 13 '18 at 14:02



















  • What's your desired result, only id = 1 passes your test?

    – jpp
    Nov 13 '18 at 13:38











  • Hi, I've updated the dataframe and expected result. let me know if it helps

    – Osceria
    Nov 13 '18 at 14:02

















What's your desired result, only id = 1 passes your test?

– jpp
Nov 13 '18 at 13:38





What's your desired result, only id = 1 passes your test?

– jpp
Nov 13 '18 at 13:38













Hi, I've updated the dataframe and expected result. let me know if it helps

– Osceria
Nov 13 '18 at 14:02





Hi, I've updated the dataframe and expected result. let me know if it helps

– Osceria
Nov 13 '18 at 14:02












2 Answers
2






active

oldest

votes


















0














You may get what you want with the code (assuming that df has index named id):



def handler(df):
for col in ['field1', 'field2', 'field3']:
if df.loc[:, col].nunique() > 1:
return 'error in {} for id {}'.format(col, df.index[0])
else:
return 'pass'

result = df.groupby(level=0).apply(handler)
result = df.reset_index().merge(result.to_frame().reset_index(), on='id')


result is:



   id name field1          field2          field3                         0
0 1 AA Yes Consumer Not Applicable pass
1 1 BB Yes Consumer Not Applicable pass
2 2 CC Yes Consumer Not Applicable error in field1 for id 2
3 2 DD Yes Not Applicable Not Applicable error in field1 for id 2
4 2 EE No Not Applicable Modified error in field1 for id 2
5 3 FF Yes Not Applicable Applicable error in field3 for id 3
6 3 GG Yes Not Applicable Not Applicable error in field3 for id 3
7 3 HH Yes Not Applicable Not Applicable error in field3 for id 3


EDIT - minor editions in handler



def handler(df):
cols = list()
for col in ['field1', 'field2', 'field3']:
if df.loc[:, col].nunique() > 1:
cols.append(col)
if cols:
return 'error in {} for id {}'.format(', '.join(cols), df.index[0])
else:
return 'pass'





share|improve this answer


























  • Hi Poolka, however, this code satisfies almost the expected results. But, on the error column, it's not showing if the data mismatches on more than field. For id 2: it should write as Field1, Field2 & Field 3 mismatch for ID:2. Any thoughts?

    – Osceria
    Nov 13 '18 at 16:28











  • The data comparison always happens on the first field of the list and other fields are skipped.

    – Osceria
    Nov 14 '18 at 0:52











  • @Osceria The code in the answer is the basis that works and performs something pretty close to what you want. Feel free to modify it (column names, handler, and so on) to meet your expectations. About the issue in the comment - check the EDIT addition.

    – Poolka
    Nov 14 '18 at 6:25











  • It works perfectly for my requirements after making slight changes. I just posted another post with a similar question but with an extra check. stackoverflow.com/questions/53295685/…

    – Osceria
    Nov 14 '18 at 8:39



















0














You could groupby id and then agg each column calculating the number of unique values per group and then you know there is a mistake where that number is greater than 1:



df[df.columns.drop('name')].groupby('id').agg(lambda x: len(x.unique()))>1


With this output, based on which you could construct your string.



    field1  field2  field3
id
1 False False False
2 True True True
3 False False True





share|improve this answer


























  • -This helps. What if the column names to be validated differ in different iterations. I run this piece in a for loop which iterates with different data frames(df1,df2) and the columns of df,df2 and df3 are different. So, I don't want to hardcode the names of the fields which keep changing for other dataframes

    – Osceria
    Nov 13 '18 at 15:54













  • See the edit, you can pass the columns list dropping the 'name' column and then you can pass any other number of fields..

    – Franco Piccolo
    Nov 13 '18 at 17:52











  • Okay. In case I add another column(Base) to the dataframe(edited). For every group based on ID, there will be only one 'Y' and other rows in the group will be 'N'. Here, the values of the rows where Base='Y' should be the reference and other rows with Base 'N' should be validated against it. The distinct columns on each row should be noted as an error column. Any thoughts?

    – Osceria
    Nov 14 '18 at 0:49











  • That completely changes the scope of the question and the solution, I would suggest writing another question with a different input and output for clarification..

    – Franco Piccolo
    Nov 14 '18 at 6:53











  • Okay. Posted here stackoverflow.com/questions/53295685/…

    – Osceria
    Nov 14 '18 at 8:40











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53281433%2fhow-to-check-if-panda-dataframe-group-have-same-data%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














You may get what you want with the code (assuming that df has index named id):



def handler(df):
for col in ['field1', 'field2', 'field3']:
if df.loc[:, col].nunique() > 1:
return 'error in {} for id {}'.format(col, df.index[0])
else:
return 'pass'

result = df.groupby(level=0).apply(handler)
result = df.reset_index().merge(result.to_frame().reset_index(), on='id')


result is:



   id name field1          field2          field3                         0
0 1 AA Yes Consumer Not Applicable pass
1 1 BB Yes Consumer Not Applicable pass
2 2 CC Yes Consumer Not Applicable error in field1 for id 2
3 2 DD Yes Not Applicable Not Applicable error in field1 for id 2
4 2 EE No Not Applicable Modified error in field1 for id 2
5 3 FF Yes Not Applicable Applicable error in field3 for id 3
6 3 GG Yes Not Applicable Not Applicable error in field3 for id 3
7 3 HH Yes Not Applicable Not Applicable error in field3 for id 3


EDIT - minor editions in handler



def handler(df):
cols = list()
for col in ['field1', 'field2', 'field3']:
if df.loc[:, col].nunique() > 1:
cols.append(col)
if cols:
return 'error in {} for id {}'.format(', '.join(cols), df.index[0])
else:
return 'pass'





share|improve this answer


























  • Hi Poolka, however, this code satisfies almost the expected results. But, on the error column, it's not showing if the data mismatches on more than field. For id 2: it should write as Field1, Field2 & Field 3 mismatch for ID:2. Any thoughts?

    – Osceria
    Nov 13 '18 at 16:28











  • The data comparison always happens on the first field of the list and other fields are skipped.

    – Osceria
    Nov 14 '18 at 0:52











  • @Osceria The code in the answer is the basis that works and performs something pretty close to what you want. Feel free to modify it (column names, handler, and so on) to meet your expectations. About the issue in the comment - check the EDIT addition.

    – Poolka
    Nov 14 '18 at 6:25











  • It works perfectly for my requirements after making slight changes. I just posted another post with a similar question but with an extra check. stackoverflow.com/questions/53295685/…

    – Osceria
    Nov 14 '18 at 8:39
















0














You may get what you want with the code (assuming that df has index named id):



def handler(df):
for col in ['field1', 'field2', 'field3']:
if df.loc[:, col].nunique() > 1:
return 'error in {} for id {}'.format(col, df.index[0])
else:
return 'pass'

result = df.groupby(level=0).apply(handler)
result = df.reset_index().merge(result.to_frame().reset_index(), on='id')


result is:



   id name field1          field2          field3                         0
0 1 AA Yes Consumer Not Applicable pass
1 1 BB Yes Consumer Not Applicable pass
2 2 CC Yes Consumer Not Applicable error in field1 for id 2
3 2 DD Yes Not Applicable Not Applicable error in field1 for id 2
4 2 EE No Not Applicable Modified error in field1 for id 2
5 3 FF Yes Not Applicable Applicable error in field3 for id 3
6 3 GG Yes Not Applicable Not Applicable error in field3 for id 3
7 3 HH Yes Not Applicable Not Applicable error in field3 for id 3


EDIT - minor editions in handler



def handler(df):
cols = list()
for col in ['field1', 'field2', 'field3']:
if df.loc[:, col].nunique() > 1:
cols.append(col)
if cols:
return 'error in {} for id {}'.format(', '.join(cols), df.index[0])
else:
return 'pass'





share|improve this answer


























  • Hi Poolka, however, this code satisfies almost the expected results. But, on the error column, it's not showing if the data mismatches on more than field. For id 2: it should write as Field1, Field2 & Field 3 mismatch for ID:2. Any thoughts?

    – Osceria
    Nov 13 '18 at 16:28











  • The data comparison always happens on the first field of the list and other fields are skipped.

    – Osceria
    Nov 14 '18 at 0:52











  • @Osceria The code in the answer is the basis that works and performs something pretty close to what you want. Feel free to modify it (column names, handler, and so on) to meet your expectations. About the issue in the comment - check the EDIT addition.

    – Poolka
    Nov 14 '18 at 6:25











  • It works perfectly for my requirements after making slight changes. I just posted another post with a similar question but with an extra check. stackoverflow.com/questions/53295685/…

    – Osceria
    Nov 14 '18 at 8:39














0












0








0







You may get what you want with the code (assuming that df has index named id):



def handler(df):
for col in ['field1', 'field2', 'field3']:
if df.loc[:, col].nunique() > 1:
return 'error in {} for id {}'.format(col, df.index[0])
else:
return 'pass'

result = df.groupby(level=0).apply(handler)
result = df.reset_index().merge(result.to_frame().reset_index(), on='id')


result is:



   id name field1          field2          field3                         0
0 1 AA Yes Consumer Not Applicable pass
1 1 BB Yes Consumer Not Applicable pass
2 2 CC Yes Consumer Not Applicable error in field1 for id 2
3 2 DD Yes Not Applicable Not Applicable error in field1 for id 2
4 2 EE No Not Applicable Modified error in field1 for id 2
5 3 FF Yes Not Applicable Applicable error in field3 for id 3
6 3 GG Yes Not Applicable Not Applicable error in field3 for id 3
7 3 HH Yes Not Applicable Not Applicable error in field3 for id 3


EDIT - minor editions in handler



def handler(df):
cols = list()
for col in ['field1', 'field2', 'field3']:
if df.loc[:, col].nunique() > 1:
cols.append(col)
if cols:
return 'error in {} for id {}'.format(', '.join(cols), df.index[0])
else:
return 'pass'





share|improve this answer















You may get what you want with the code (assuming that df has index named id):



def handler(df):
for col in ['field1', 'field2', 'field3']:
if df.loc[:, col].nunique() > 1:
return 'error in {} for id {}'.format(col, df.index[0])
else:
return 'pass'

result = df.groupby(level=0).apply(handler)
result = df.reset_index().merge(result.to_frame().reset_index(), on='id')


result is:



   id name field1          field2          field3                         0
0 1 AA Yes Consumer Not Applicable pass
1 1 BB Yes Consumer Not Applicable pass
2 2 CC Yes Consumer Not Applicable error in field1 for id 2
3 2 DD Yes Not Applicable Not Applicable error in field1 for id 2
4 2 EE No Not Applicable Modified error in field1 for id 2
5 3 FF Yes Not Applicable Applicable error in field3 for id 3
6 3 GG Yes Not Applicable Not Applicable error in field3 for id 3
7 3 HH Yes Not Applicable Not Applicable error in field3 for id 3


EDIT - minor editions in handler



def handler(df):
cols = list()
for col in ['field1', 'field2', 'field3']:
if df.loc[:, col].nunique() > 1:
cols.append(col)
if cols:
return 'error in {} for id {}'.format(', '.join(cols), df.index[0])
else:
return 'pass'






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 14 '18 at 6:19

























answered Nov 13 '18 at 14:52









PoolkaPoolka

1,5011211




1,5011211













  • Hi Poolka, however, this code satisfies almost the expected results. But, on the error column, it's not showing if the data mismatches on more than field. For id 2: it should write as Field1, Field2 & Field 3 mismatch for ID:2. Any thoughts?

    – Osceria
    Nov 13 '18 at 16:28











  • The data comparison always happens on the first field of the list and other fields are skipped.

    – Osceria
    Nov 14 '18 at 0:52











  • @Osceria The code in the answer is the basis that works and performs something pretty close to what you want. Feel free to modify it (column names, handler, and so on) to meet your expectations. About the issue in the comment - check the EDIT addition.

    – Poolka
    Nov 14 '18 at 6:25











  • It works perfectly for my requirements after making slight changes. I just posted another post with a similar question but with an extra check. stackoverflow.com/questions/53295685/…

    – Osceria
    Nov 14 '18 at 8:39



















  • Hi Poolka, however, this code satisfies almost the expected results. But, on the error column, it's not showing if the data mismatches on more than field. For id 2: it should write as Field1, Field2 & Field 3 mismatch for ID:2. Any thoughts?

    – Osceria
    Nov 13 '18 at 16:28











  • The data comparison always happens on the first field of the list and other fields are skipped.

    – Osceria
    Nov 14 '18 at 0:52











  • @Osceria The code in the answer is the basis that works and performs something pretty close to what you want. Feel free to modify it (column names, handler, and so on) to meet your expectations. About the issue in the comment - check the EDIT addition.

    – Poolka
    Nov 14 '18 at 6:25











  • It works perfectly for my requirements after making slight changes. I just posted another post with a similar question but with an extra check. stackoverflow.com/questions/53295685/…

    – Osceria
    Nov 14 '18 at 8:39

















Hi Poolka, however, this code satisfies almost the expected results. But, on the error column, it's not showing if the data mismatches on more than field. For id 2: it should write as Field1, Field2 & Field 3 mismatch for ID:2. Any thoughts?

– Osceria
Nov 13 '18 at 16:28





Hi Poolka, however, this code satisfies almost the expected results. But, on the error column, it's not showing if the data mismatches on more than field. For id 2: it should write as Field1, Field2 & Field 3 mismatch for ID:2. Any thoughts?

– Osceria
Nov 13 '18 at 16:28













The data comparison always happens on the first field of the list and other fields are skipped.

– Osceria
Nov 14 '18 at 0:52





The data comparison always happens on the first field of the list and other fields are skipped.

– Osceria
Nov 14 '18 at 0:52













@Osceria The code in the answer is the basis that works and performs something pretty close to what you want. Feel free to modify it (column names, handler, and so on) to meet your expectations. About the issue in the comment - check the EDIT addition.

– Poolka
Nov 14 '18 at 6:25





@Osceria The code in the answer is the basis that works and performs something pretty close to what you want. Feel free to modify it (column names, handler, and so on) to meet your expectations. About the issue in the comment - check the EDIT addition.

– Poolka
Nov 14 '18 at 6:25













It works perfectly for my requirements after making slight changes. I just posted another post with a similar question but with an extra check. stackoverflow.com/questions/53295685/…

– Osceria
Nov 14 '18 at 8:39





It works perfectly for my requirements after making slight changes. I just posted another post with a similar question but with an extra check. stackoverflow.com/questions/53295685/…

– Osceria
Nov 14 '18 at 8:39













0














You could groupby id and then agg each column calculating the number of unique values per group and then you know there is a mistake where that number is greater than 1:



df[df.columns.drop('name')].groupby('id').agg(lambda x: len(x.unique()))>1


With this output, based on which you could construct your string.



    field1  field2  field3
id
1 False False False
2 True True True
3 False False True





share|improve this answer


























  • -This helps. What if the column names to be validated differ in different iterations. I run this piece in a for loop which iterates with different data frames(df1,df2) and the columns of df,df2 and df3 are different. So, I don't want to hardcode the names of the fields which keep changing for other dataframes

    – Osceria
    Nov 13 '18 at 15:54













  • See the edit, you can pass the columns list dropping the 'name' column and then you can pass any other number of fields..

    – Franco Piccolo
    Nov 13 '18 at 17:52











  • Okay. In case I add another column(Base) to the dataframe(edited). For every group based on ID, there will be only one 'Y' and other rows in the group will be 'N'. Here, the values of the rows where Base='Y' should be the reference and other rows with Base 'N' should be validated against it. The distinct columns on each row should be noted as an error column. Any thoughts?

    – Osceria
    Nov 14 '18 at 0:49











  • That completely changes the scope of the question and the solution, I would suggest writing another question with a different input and output for clarification..

    – Franco Piccolo
    Nov 14 '18 at 6:53











  • Okay. Posted here stackoverflow.com/questions/53295685/…

    – Osceria
    Nov 14 '18 at 8:40
















0














You could groupby id and then agg each column calculating the number of unique values per group and then you know there is a mistake where that number is greater than 1:



df[df.columns.drop('name')].groupby('id').agg(lambda x: len(x.unique()))>1


With this output, based on which you could construct your string.



    field1  field2  field3
id
1 False False False
2 True True True
3 False False True





share|improve this answer


























  • -This helps. What if the column names to be validated differ in different iterations. I run this piece in a for loop which iterates with different data frames(df1,df2) and the columns of df,df2 and df3 are different. So, I don't want to hardcode the names of the fields which keep changing for other dataframes

    – Osceria
    Nov 13 '18 at 15:54













  • See the edit, you can pass the columns list dropping the 'name' column and then you can pass any other number of fields..

    – Franco Piccolo
    Nov 13 '18 at 17:52











  • Okay. In case I add another column(Base) to the dataframe(edited). For every group based on ID, there will be only one 'Y' and other rows in the group will be 'N'. Here, the values of the rows where Base='Y' should be the reference and other rows with Base 'N' should be validated against it. The distinct columns on each row should be noted as an error column. Any thoughts?

    – Osceria
    Nov 14 '18 at 0:49











  • That completely changes the scope of the question and the solution, I would suggest writing another question with a different input and output for clarification..

    – Franco Piccolo
    Nov 14 '18 at 6:53











  • Okay. Posted here stackoverflow.com/questions/53295685/…

    – Osceria
    Nov 14 '18 at 8:40














0












0








0







You could groupby id and then agg each column calculating the number of unique values per group and then you know there is a mistake where that number is greater than 1:



df[df.columns.drop('name')].groupby('id').agg(lambda x: len(x.unique()))>1


With this output, based on which you could construct your string.



    field1  field2  field3
id
1 False False False
2 True True True
3 False False True





share|improve this answer















You could groupby id and then agg each column calculating the number of unique values per group and then you know there is a mistake where that number is greater than 1:



df[df.columns.drop('name')].groupby('id').agg(lambda x: len(x.unique()))>1


With this output, based on which you could construct your string.



    field1  field2  field3
id
1 False False False
2 True True True
3 False False True






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 13 '18 at 17:52

























answered Nov 13 '18 at 14:53









Franco PiccoloFranco Piccolo

1,576712




1,576712













  • -This helps. What if the column names to be validated differ in different iterations. I run this piece in a for loop which iterates with different data frames(df1,df2) and the columns of df,df2 and df3 are different. So, I don't want to hardcode the names of the fields which keep changing for other dataframes

    – Osceria
    Nov 13 '18 at 15:54













  • See the edit, you can pass the columns list dropping the 'name' column and then you can pass any other number of fields..

    – Franco Piccolo
    Nov 13 '18 at 17:52











  • Okay. In case I add another column(Base) to the dataframe(edited). For every group based on ID, there will be only one 'Y' and other rows in the group will be 'N'. Here, the values of the rows where Base='Y' should be the reference and other rows with Base 'N' should be validated against it. The distinct columns on each row should be noted as an error column. Any thoughts?

    – Osceria
    Nov 14 '18 at 0:49











  • That completely changes the scope of the question and the solution, I would suggest writing another question with a different input and output for clarification..

    – Franco Piccolo
    Nov 14 '18 at 6:53











  • Okay. Posted here stackoverflow.com/questions/53295685/…

    – Osceria
    Nov 14 '18 at 8:40



















  • -This helps. What if the column names to be validated differ in different iterations. I run this piece in a for loop which iterates with different data frames(df1,df2) and the columns of df,df2 and df3 are different. So, I don't want to hardcode the names of the fields which keep changing for other dataframes

    – Osceria
    Nov 13 '18 at 15:54













  • See the edit, you can pass the columns list dropping the 'name' column and then you can pass any other number of fields..

    – Franco Piccolo
    Nov 13 '18 at 17:52











  • Okay. In case I add another column(Base) to the dataframe(edited). For every group based on ID, there will be only one 'Y' and other rows in the group will be 'N'. Here, the values of the rows where Base='Y' should be the reference and other rows with Base 'N' should be validated against it. The distinct columns on each row should be noted as an error column. Any thoughts?

    – Osceria
    Nov 14 '18 at 0:49











  • That completely changes the scope of the question and the solution, I would suggest writing another question with a different input and output for clarification..

    – Franco Piccolo
    Nov 14 '18 at 6:53











  • Okay. Posted here stackoverflow.com/questions/53295685/…

    – Osceria
    Nov 14 '18 at 8:40

















-This helps. What if the column names to be validated differ in different iterations. I run this piece in a for loop which iterates with different data frames(df1,df2) and the columns of df,df2 and df3 are different. So, I don't want to hardcode the names of the fields which keep changing for other dataframes

– Osceria
Nov 13 '18 at 15:54







-This helps. What if the column names to be validated differ in different iterations. I run this piece in a for loop which iterates with different data frames(df1,df2) and the columns of df,df2 and df3 are different. So, I don't want to hardcode the names of the fields which keep changing for other dataframes

– Osceria
Nov 13 '18 at 15:54















See the edit, you can pass the columns list dropping the 'name' column and then you can pass any other number of fields..

– Franco Piccolo
Nov 13 '18 at 17:52





See the edit, you can pass the columns list dropping the 'name' column and then you can pass any other number of fields..

– Franco Piccolo
Nov 13 '18 at 17:52













Okay. In case I add another column(Base) to the dataframe(edited). For every group based on ID, there will be only one 'Y' and other rows in the group will be 'N'. Here, the values of the rows where Base='Y' should be the reference and other rows with Base 'N' should be validated against it. The distinct columns on each row should be noted as an error column. Any thoughts?

– Osceria
Nov 14 '18 at 0:49





Okay. In case I add another column(Base) to the dataframe(edited). For every group based on ID, there will be only one 'Y' and other rows in the group will be 'N'. Here, the values of the rows where Base='Y' should be the reference and other rows with Base 'N' should be validated against it. The distinct columns on each row should be noted as an error column. Any thoughts?

– Osceria
Nov 14 '18 at 0:49













That completely changes the scope of the question and the solution, I would suggest writing another question with a different input and output for clarification..

– Franco Piccolo
Nov 14 '18 at 6:53





That completely changes the scope of the question and the solution, I would suggest writing another question with a different input and output for clarification..

– Franco Piccolo
Nov 14 '18 at 6:53













Okay. Posted here stackoverflow.com/questions/53295685/…

– Osceria
Nov 14 '18 at 8:40





Okay. Posted here stackoverflow.com/questions/53295685/…

– Osceria
Nov 14 '18 at 8:40


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53281433%2fhow-to-check-if-panda-dataframe-group-have-same-data%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Florida Star v. B. J. F.

Error while running script in elastic search , gateway timeout

Adding quotations to stringified JSON object values