Pandas boolean operations are inconsistent with one comparison vs. many comparisons












1














I am trying to filter out some rows in my dataframe (with > 400000 rows) where values in one column have the None type. The goal is to leave my dataframe with only rows that have values that are float in the 'Column' column. I plan on doing this by passing in an array of booleans, except that I can't construct my array of booleans properly (they all come back True).



When I run the following operation, given a value of i within the df range, the comparison works:



df.loc[i, 'Column'] != None 


The rows that have a value of None in 'Column' give the results False.



But when I run this operation:



df.loc[0:len(df), 'Column'] != None 


The boolean array comes back as all True.



Why is this? Is this a pandas bug? An edge case? Intended behaviour for reasons I don't understand?



I can think of other ways to construct my boolean array, though this seems the most efficient. But it bothers me that this is the result I am getting.










share|improve this question


















  • 1




    I think the simplest solution is df.Column.notnull() as None is a null value recognized by pandas. Though I'm unsure as to why the element-wise comparison with None fails here. Using .values works df.Column.values != None
    – ALollz
    Nov 12 '18 at 17:23








  • 1




    you can't make use of dropna() ?
    – hootnot
    Nov 12 '18 at 17:39










  • @hootnot That is what I am using now :)
    – David
    Nov 26 '18 at 2:17
















1














I am trying to filter out some rows in my dataframe (with > 400000 rows) where values in one column have the None type. The goal is to leave my dataframe with only rows that have values that are float in the 'Column' column. I plan on doing this by passing in an array of booleans, except that I can't construct my array of booleans properly (they all come back True).



When I run the following operation, given a value of i within the df range, the comparison works:



df.loc[i, 'Column'] != None 


The rows that have a value of None in 'Column' give the results False.



But when I run this operation:



df.loc[0:len(df), 'Column'] != None 


The boolean array comes back as all True.



Why is this? Is this a pandas bug? An edge case? Intended behaviour for reasons I don't understand?



I can think of other ways to construct my boolean array, though this seems the most efficient. But it bothers me that this is the result I am getting.










share|improve this question


















  • 1




    I think the simplest solution is df.Column.notnull() as None is a null value recognized by pandas. Though I'm unsure as to why the element-wise comparison with None fails here. Using .values works df.Column.values != None
    – ALollz
    Nov 12 '18 at 17:23








  • 1




    you can't make use of dropna() ?
    – hootnot
    Nov 12 '18 at 17:39










  • @hootnot That is what I am using now :)
    – David
    Nov 26 '18 at 2:17














1












1








1







I am trying to filter out some rows in my dataframe (with > 400000 rows) where values in one column have the None type. The goal is to leave my dataframe with only rows that have values that are float in the 'Column' column. I plan on doing this by passing in an array of booleans, except that I can't construct my array of booleans properly (they all come back True).



When I run the following operation, given a value of i within the df range, the comparison works:



df.loc[i, 'Column'] != None 


The rows that have a value of None in 'Column' give the results False.



But when I run this operation:



df.loc[0:len(df), 'Column'] != None 


The boolean array comes back as all True.



Why is this? Is this a pandas bug? An edge case? Intended behaviour for reasons I don't understand?



I can think of other ways to construct my boolean array, though this seems the most efficient. But it bothers me that this is the result I am getting.










share|improve this question













I am trying to filter out some rows in my dataframe (with > 400000 rows) where values in one column have the None type. The goal is to leave my dataframe with only rows that have values that are float in the 'Column' column. I plan on doing this by passing in an array of booleans, except that I can't construct my array of booleans properly (they all come back True).



When I run the following operation, given a value of i within the df range, the comparison works:



df.loc[i, 'Column'] != None 


The rows that have a value of None in 'Column' give the results False.



But when I run this operation:



df.loc[0:len(df), 'Column'] != None 


The boolean array comes back as all True.



Why is this? Is this a pandas bug? An edge case? Intended behaviour for reasons I don't understand?



I can think of other ways to construct my boolean array, though this seems the most efficient. But it bothers me that this is the result I am getting.







python pandas boolean-operations






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 12 '18 at 17:16









David

858




858








  • 1




    I think the simplest solution is df.Column.notnull() as None is a null value recognized by pandas. Though I'm unsure as to why the element-wise comparison with None fails here. Using .values works df.Column.values != None
    – ALollz
    Nov 12 '18 at 17:23








  • 1




    you can't make use of dropna() ?
    – hootnot
    Nov 12 '18 at 17:39










  • @hootnot That is what I am using now :)
    – David
    Nov 26 '18 at 2:17














  • 1




    I think the simplest solution is df.Column.notnull() as None is a null value recognized by pandas. Though I'm unsure as to why the element-wise comparison with None fails here. Using .values works df.Column.values != None
    – ALollz
    Nov 12 '18 at 17:23








  • 1




    you can't make use of dropna() ?
    – hootnot
    Nov 12 '18 at 17:39










  • @hootnot That is what I am using now :)
    – David
    Nov 26 '18 at 2:17








1




1




I think the simplest solution is df.Column.notnull() as None is a null value recognized by pandas. Though I'm unsure as to why the element-wise comparison with None fails here. Using .values works df.Column.values != None
– ALollz
Nov 12 '18 at 17:23






I think the simplest solution is df.Column.notnull() as None is a null value recognized by pandas. Though I'm unsure as to why the element-wise comparison with None fails here. Using .values works df.Column.values != None
– ALollz
Nov 12 '18 at 17:23






1




1




you can't make use of dropna() ?
– hootnot
Nov 12 '18 at 17:39




you can't make use of dropna() ?
– hootnot
Nov 12 '18 at 17:39












@hootnot That is what I am using now :)
– David
Nov 26 '18 at 2:17




@hootnot That is what I am using now :)
– David
Nov 26 '18 at 2:17












1 Answer
1






active

oldest

votes


















3














Here's a reproducible example of what you're seeing:



x = pd.Series([1, None, 3, None, None])

print(x != None)

0 True
1 True
2 True
3 True
4 True
dtype: bool


What's not obvious is behind the scenes Pandas converts your series to numeric and converts those None values to np.nan:



print(x)

0 1.0
1 NaN
2 3.0
3 NaN
4 NaN
dtype: float64


The NumPy array underlying the series can then be held in a contiguous memory block and support vectorised operations. Since np.nan != np.nan by design, your Boolean series will contain only True values, even if you were to test against np.nan instead of None.



For efficiency and correctness, you should use pd.to_numeric with isnull / notnull for checking null values:



print(pd.to_numeric(x, errors='coerce').notnull())

0 True
1 False
2 True
3 False
4 False
dtype: bool





share|improve this answer























  • Though, even if the Series isn't initially converted x = pd.Series(['1', None, 'hello', None, None]), there must still be some conversion happening during the comparison?
    – ALollz
    Nov 12 '18 at 17:34








  • 2




    @ALollz, Yup, seems so. I haven't dug into the source for pd.Series.__eq__, my instinct is there's custom logic & edge cases [which also explains why using NumPy array for comparisons is faster]. Best to avoid all this and use pd.to_numeric.
    – jpp
    Nov 12 '18 at 17:36












  • @jpp Thank you very much! This is working well for me :) Regarding pd.to_numeric, why do you use that with 'coerce' rather than 'raise'? Is it because you expect and are okay with a few bad values in your rows (but don't want to analysis/application to throw an exception)? If correctness is your main concern, wouldn't you want to use 'raise', since NaN rows will affect your analysis without making you aware of it?
    – David
    Nov 19 '18 at 22:38






  • 1




    @David, It's up to you. I make the assumption your data is, or should be clean. If it's not, you can use 'raise'.
    – jpp
    Nov 19 '18 at 23:58











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53267064%2fpandas-boolean-operations-are-inconsistent-with-one-comparison-vs-many-comparis%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









3














Here's a reproducible example of what you're seeing:



x = pd.Series([1, None, 3, None, None])

print(x != None)

0 True
1 True
2 True
3 True
4 True
dtype: bool


What's not obvious is behind the scenes Pandas converts your series to numeric and converts those None values to np.nan:



print(x)

0 1.0
1 NaN
2 3.0
3 NaN
4 NaN
dtype: float64


The NumPy array underlying the series can then be held in a contiguous memory block and support vectorised operations. Since np.nan != np.nan by design, your Boolean series will contain only True values, even if you were to test against np.nan instead of None.



For efficiency and correctness, you should use pd.to_numeric with isnull / notnull for checking null values:



print(pd.to_numeric(x, errors='coerce').notnull())

0 True
1 False
2 True
3 False
4 False
dtype: bool





share|improve this answer























  • Though, even if the Series isn't initially converted x = pd.Series(['1', None, 'hello', None, None]), there must still be some conversion happening during the comparison?
    – ALollz
    Nov 12 '18 at 17:34








  • 2




    @ALollz, Yup, seems so. I haven't dug into the source for pd.Series.__eq__, my instinct is there's custom logic & edge cases [which also explains why using NumPy array for comparisons is faster]. Best to avoid all this and use pd.to_numeric.
    – jpp
    Nov 12 '18 at 17:36












  • @jpp Thank you very much! This is working well for me :) Regarding pd.to_numeric, why do you use that with 'coerce' rather than 'raise'? Is it because you expect and are okay with a few bad values in your rows (but don't want to analysis/application to throw an exception)? If correctness is your main concern, wouldn't you want to use 'raise', since NaN rows will affect your analysis without making you aware of it?
    – David
    Nov 19 '18 at 22:38






  • 1




    @David, It's up to you. I make the assumption your data is, or should be clean. If it's not, you can use 'raise'.
    – jpp
    Nov 19 '18 at 23:58
















3














Here's a reproducible example of what you're seeing:



x = pd.Series([1, None, 3, None, None])

print(x != None)

0 True
1 True
2 True
3 True
4 True
dtype: bool


What's not obvious is behind the scenes Pandas converts your series to numeric and converts those None values to np.nan:



print(x)

0 1.0
1 NaN
2 3.0
3 NaN
4 NaN
dtype: float64


The NumPy array underlying the series can then be held in a contiguous memory block and support vectorised operations. Since np.nan != np.nan by design, your Boolean series will contain only True values, even if you were to test against np.nan instead of None.



For efficiency and correctness, you should use pd.to_numeric with isnull / notnull for checking null values:



print(pd.to_numeric(x, errors='coerce').notnull())

0 True
1 False
2 True
3 False
4 False
dtype: bool





share|improve this answer























  • Though, even if the Series isn't initially converted x = pd.Series(['1', None, 'hello', None, None]), there must still be some conversion happening during the comparison?
    – ALollz
    Nov 12 '18 at 17:34








  • 2




    @ALollz, Yup, seems so. I haven't dug into the source for pd.Series.__eq__, my instinct is there's custom logic & edge cases [which also explains why using NumPy array for comparisons is faster]. Best to avoid all this and use pd.to_numeric.
    – jpp
    Nov 12 '18 at 17:36












  • @jpp Thank you very much! This is working well for me :) Regarding pd.to_numeric, why do you use that with 'coerce' rather than 'raise'? Is it because you expect and are okay with a few bad values in your rows (but don't want to analysis/application to throw an exception)? If correctness is your main concern, wouldn't you want to use 'raise', since NaN rows will affect your analysis without making you aware of it?
    – David
    Nov 19 '18 at 22:38






  • 1




    @David, It's up to you. I make the assumption your data is, or should be clean. If it's not, you can use 'raise'.
    – jpp
    Nov 19 '18 at 23:58














3












3








3






Here's a reproducible example of what you're seeing:



x = pd.Series([1, None, 3, None, None])

print(x != None)

0 True
1 True
2 True
3 True
4 True
dtype: bool


What's not obvious is behind the scenes Pandas converts your series to numeric and converts those None values to np.nan:



print(x)

0 1.0
1 NaN
2 3.0
3 NaN
4 NaN
dtype: float64


The NumPy array underlying the series can then be held in a contiguous memory block and support vectorised operations. Since np.nan != np.nan by design, your Boolean series will contain only True values, even if you were to test against np.nan instead of None.



For efficiency and correctness, you should use pd.to_numeric with isnull / notnull for checking null values:



print(pd.to_numeric(x, errors='coerce').notnull())

0 True
1 False
2 True
3 False
4 False
dtype: bool





share|improve this answer














Here's a reproducible example of what you're seeing:



x = pd.Series([1, None, 3, None, None])

print(x != None)

0 True
1 True
2 True
3 True
4 True
dtype: bool


What's not obvious is behind the scenes Pandas converts your series to numeric and converts those None values to np.nan:



print(x)

0 1.0
1 NaN
2 3.0
3 NaN
4 NaN
dtype: float64


The NumPy array underlying the series can then be held in a contiguous memory block and support vectorised operations. Since np.nan != np.nan by design, your Boolean series will contain only True values, even if you were to test against np.nan instead of None.



For efficiency and correctness, you should use pd.to_numeric with isnull / notnull for checking null values:



print(pd.to_numeric(x, errors='coerce').notnull())

0 True
1 False
2 True
3 False
4 False
dtype: bool






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 13 '18 at 0:49

























answered Nov 12 '18 at 17:26









jpp

92.6k2054103




92.6k2054103












  • Though, even if the Series isn't initially converted x = pd.Series(['1', None, 'hello', None, None]), there must still be some conversion happening during the comparison?
    – ALollz
    Nov 12 '18 at 17:34








  • 2




    @ALollz, Yup, seems so. I haven't dug into the source for pd.Series.__eq__, my instinct is there's custom logic & edge cases [which also explains why using NumPy array for comparisons is faster]. Best to avoid all this and use pd.to_numeric.
    – jpp
    Nov 12 '18 at 17:36












  • @jpp Thank you very much! This is working well for me :) Regarding pd.to_numeric, why do you use that with 'coerce' rather than 'raise'? Is it because you expect and are okay with a few bad values in your rows (but don't want to analysis/application to throw an exception)? If correctness is your main concern, wouldn't you want to use 'raise', since NaN rows will affect your analysis without making you aware of it?
    – David
    Nov 19 '18 at 22:38






  • 1




    @David, It's up to you. I make the assumption your data is, or should be clean. If it's not, you can use 'raise'.
    – jpp
    Nov 19 '18 at 23:58


















  • Though, even if the Series isn't initially converted x = pd.Series(['1', None, 'hello', None, None]), there must still be some conversion happening during the comparison?
    – ALollz
    Nov 12 '18 at 17:34








  • 2




    @ALollz, Yup, seems so. I haven't dug into the source for pd.Series.__eq__, my instinct is there's custom logic & edge cases [which also explains why using NumPy array for comparisons is faster]. Best to avoid all this and use pd.to_numeric.
    – jpp
    Nov 12 '18 at 17:36












  • @jpp Thank you very much! This is working well for me :) Regarding pd.to_numeric, why do you use that with 'coerce' rather than 'raise'? Is it because you expect and are okay with a few bad values in your rows (but don't want to analysis/application to throw an exception)? If correctness is your main concern, wouldn't you want to use 'raise', since NaN rows will affect your analysis without making you aware of it?
    – David
    Nov 19 '18 at 22:38






  • 1




    @David, It's up to you. I make the assumption your data is, or should be clean. If it's not, you can use 'raise'.
    – jpp
    Nov 19 '18 at 23:58
















Though, even if the Series isn't initially converted x = pd.Series(['1', None, 'hello', None, None]), there must still be some conversion happening during the comparison?
– ALollz
Nov 12 '18 at 17:34






Though, even if the Series isn't initially converted x = pd.Series(['1', None, 'hello', None, None]), there must still be some conversion happening during the comparison?
– ALollz
Nov 12 '18 at 17:34






2




2




@ALollz, Yup, seems so. I haven't dug into the source for pd.Series.__eq__, my instinct is there's custom logic & edge cases [which also explains why using NumPy array for comparisons is faster]. Best to avoid all this and use pd.to_numeric.
– jpp
Nov 12 '18 at 17:36






@ALollz, Yup, seems so. I haven't dug into the source for pd.Series.__eq__, my instinct is there's custom logic & edge cases [which also explains why using NumPy array for comparisons is faster]. Best to avoid all this and use pd.to_numeric.
– jpp
Nov 12 '18 at 17:36














@jpp Thank you very much! This is working well for me :) Regarding pd.to_numeric, why do you use that with 'coerce' rather than 'raise'? Is it because you expect and are okay with a few bad values in your rows (but don't want to analysis/application to throw an exception)? If correctness is your main concern, wouldn't you want to use 'raise', since NaN rows will affect your analysis without making you aware of it?
– David
Nov 19 '18 at 22:38




@jpp Thank you very much! This is working well for me :) Regarding pd.to_numeric, why do you use that with 'coerce' rather than 'raise'? Is it because you expect and are okay with a few bad values in your rows (but don't want to analysis/application to throw an exception)? If correctness is your main concern, wouldn't you want to use 'raise', since NaN rows will affect your analysis without making you aware of it?
– David
Nov 19 '18 at 22:38




1




1




@David, It's up to you. I make the assumption your data is, or should be clean. If it's not, you can use 'raise'.
– jpp
Nov 19 '18 at 23:58




@David, It's up to you. I make the assumption your data is, or should be clean. If it's not, you can use 'raise'.
– jpp
Nov 19 '18 at 23:58


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53267064%2fpandas-boolean-operations-are-inconsistent-with-one-comparison-vs-many-comparis%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Florida Star v. B. J. F.

Danny Elfman

Retrieve a Users Dashboard in Tumblr with R and TumblR. Oauth Issues