Pandas boolean operations are inconsistent with one comparison vs. many comparisons
I am trying to filter out some rows in my dataframe (with > 400000 rows) where values in one column have the None type. The goal is to leave my dataframe with only rows that have values that are float in the 'Column' column. I plan on doing this by passing in an array of booleans, except that I can't construct my array of booleans properly (they all come back True).
When I run the following operation, given a value of i within the df range, the comparison works:
df.loc[i, 'Column'] != None
The rows that have a value of None in 'Column' give the results False.
But when I run this operation:
df.loc[0:len(df), 'Column'] != None
The boolean array comes back as all True.
Why is this? Is this a pandas bug? An edge case? Intended behaviour for reasons I don't understand?
I can think of other ways to construct my boolean array, though this seems the most efficient. But it bothers me that this is the result I am getting.
python pandas boolean-operations
add a comment |
I am trying to filter out some rows in my dataframe (with > 400000 rows) where values in one column have the None type. The goal is to leave my dataframe with only rows that have values that are float in the 'Column' column. I plan on doing this by passing in an array of booleans, except that I can't construct my array of booleans properly (they all come back True).
When I run the following operation, given a value of i within the df range, the comparison works:
df.loc[i, 'Column'] != None
The rows that have a value of None in 'Column' give the results False.
But when I run this operation:
df.loc[0:len(df), 'Column'] != None
The boolean array comes back as all True.
Why is this? Is this a pandas bug? An edge case? Intended behaviour for reasons I don't understand?
I can think of other ways to construct my boolean array, though this seems the most efficient. But it bothers me that this is the result I am getting.
python pandas boolean-operations
1
I think the simplest solution isdf.Column.notnull()
asNone
is a null value recognized by pandas. Though I'm unsure as to why the element-wise comparison withNone
fails here. Using.values
worksdf.Column.values != None
– ALollz
Nov 12 '18 at 17:23
1
you can't make use ofdropna()
?
– hootnot
Nov 12 '18 at 17:39
@hootnot That is what I am using now :)
– David
Nov 26 '18 at 2:17
add a comment |
I am trying to filter out some rows in my dataframe (with > 400000 rows) where values in one column have the None type. The goal is to leave my dataframe with only rows that have values that are float in the 'Column' column. I plan on doing this by passing in an array of booleans, except that I can't construct my array of booleans properly (they all come back True).
When I run the following operation, given a value of i within the df range, the comparison works:
df.loc[i, 'Column'] != None
The rows that have a value of None in 'Column' give the results False.
But when I run this operation:
df.loc[0:len(df), 'Column'] != None
The boolean array comes back as all True.
Why is this? Is this a pandas bug? An edge case? Intended behaviour for reasons I don't understand?
I can think of other ways to construct my boolean array, though this seems the most efficient. But it bothers me that this is the result I am getting.
python pandas boolean-operations
I am trying to filter out some rows in my dataframe (with > 400000 rows) where values in one column have the None type. The goal is to leave my dataframe with only rows that have values that are float in the 'Column' column. I plan on doing this by passing in an array of booleans, except that I can't construct my array of booleans properly (they all come back True).
When I run the following operation, given a value of i within the df range, the comparison works:
df.loc[i, 'Column'] != None
The rows that have a value of None in 'Column' give the results False.
But when I run this operation:
df.loc[0:len(df), 'Column'] != None
The boolean array comes back as all True.
Why is this? Is this a pandas bug? An edge case? Intended behaviour for reasons I don't understand?
I can think of other ways to construct my boolean array, though this seems the most efficient. But it bothers me that this is the result I am getting.
python pandas boolean-operations
python pandas boolean-operations
asked Nov 12 '18 at 17:16
David
858
858
1
I think the simplest solution isdf.Column.notnull()
asNone
is a null value recognized by pandas. Though I'm unsure as to why the element-wise comparison withNone
fails here. Using.values
worksdf.Column.values != None
– ALollz
Nov 12 '18 at 17:23
1
you can't make use ofdropna()
?
– hootnot
Nov 12 '18 at 17:39
@hootnot That is what I am using now :)
– David
Nov 26 '18 at 2:17
add a comment |
1
I think the simplest solution isdf.Column.notnull()
asNone
is a null value recognized by pandas. Though I'm unsure as to why the element-wise comparison withNone
fails here. Using.values
worksdf.Column.values != None
– ALollz
Nov 12 '18 at 17:23
1
you can't make use ofdropna()
?
– hootnot
Nov 12 '18 at 17:39
@hootnot That is what I am using now :)
– David
Nov 26 '18 at 2:17
1
1
I think the simplest solution is
df.Column.notnull()
as None
is a null value recognized by pandas. Though I'm unsure as to why the element-wise comparison with None
fails here. Using .values
works df.Column.values != None
– ALollz
Nov 12 '18 at 17:23
I think the simplest solution is
df.Column.notnull()
as None
is a null value recognized by pandas. Though I'm unsure as to why the element-wise comparison with None
fails here. Using .values
works df.Column.values != None
– ALollz
Nov 12 '18 at 17:23
1
1
you can't make use of
dropna()
?– hootnot
Nov 12 '18 at 17:39
you can't make use of
dropna()
?– hootnot
Nov 12 '18 at 17:39
@hootnot That is what I am using now :)
– David
Nov 26 '18 at 2:17
@hootnot That is what I am using now :)
– David
Nov 26 '18 at 2:17
add a comment |
1 Answer
1
active
oldest
votes
Here's a reproducible example of what you're seeing:
x = pd.Series([1, None, 3, None, None])
print(x != None)
0 True
1 True
2 True
3 True
4 True
dtype: bool
What's not obvious is behind the scenes Pandas converts your series to numeric and converts those None
values to np.nan
:
print(x)
0 1.0
1 NaN
2 3.0
3 NaN
4 NaN
dtype: float64
The NumPy array underlying the series can then be held in a contiguous memory block and support vectorised operations. Since np.nan != np.nan
by design, your Boolean series will contain only True
values, even if you were to test against np.nan
instead of None
.
For efficiency and correctness, you should use pd.to_numeric
with isnull
/ notnull
for checking null values:
print(pd.to_numeric(x, errors='coerce').notnull())
0 True
1 False
2 True
3 False
4 False
dtype: bool
Though, even if theSeries
isn't initially convertedx = pd.Series(['1', None, 'hello', None, None])
, there must still be some conversion happening during the comparison?
– ALollz
Nov 12 '18 at 17:34
2
@ALollz, Yup, seems so. I haven't dug into the source forpd.Series.__eq__
, my instinct is there's custom logic & edge cases [which also explains why using NumPy array for comparisons is faster]. Best to avoid all this and usepd.to_numeric
.
– jpp
Nov 12 '18 at 17:36
@jpp Thank you very much! This is working well for me :) Regarding pd.to_numeric, why do you use that with 'coerce' rather than 'raise'? Is it because you expect and are okay with a few bad values in your rows (but don't want to analysis/application to throw an exception)? If correctness is your main concern, wouldn't you want to use 'raise', since NaN rows will affect your analysis without making you aware of it?
– David
Nov 19 '18 at 22:38
1
@David, It's up to you. I make the assumption your data is, or should be clean. If it's not, you can use'raise'
.
– jpp
Nov 19 '18 at 23:58
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53267064%2fpandas-boolean-operations-are-inconsistent-with-one-comparison-vs-many-comparis%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Here's a reproducible example of what you're seeing:
x = pd.Series([1, None, 3, None, None])
print(x != None)
0 True
1 True
2 True
3 True
4 True
dtype: bool
What's not obvious is behind the scenes Pandas converts your series to numeric and converts those None
values to np.nan
:
print(x)
0 1.0
1 NaN
2 3.0
3 NaN
4 NaN
dtype: float64
The NumPy array underlying the series can then be held in a contiguous memory block and support vectorised operations. Since np.nan != np.nan
by design, your Boolean series will contain only True
values, even if you were to test against np.nan
instead of None
.
For efficiency and correctness, you should use pd.to_numeric
with isnull
/ notnull
for checking null values:
print(pd.to_numeric(x, errors='coerce').notnull())
0 True
1 False
2 True
3 False
4 False
dtype: bool
Though, even if theSeries
isn't initially convertedx = pd.Series(['1', None, 'hello', None, None])
, there must still be some conversion happening during the comparison?
– ALollz
Nov 12 '18 at 17:34
2
@ALollz, Yup, seems so. I haven't dug into the source forpd.Series.__eq__
, my instinct is there's custom logic & edge cases [which also explains why using NumPy array for comparisons is faster]. Best to avoid all this and usepd.to_numeric
.
– jpp
Nov 12 '18 at 17:36
@jpp Thank you very much! This is working well for me :) Regarding pd.to_numeric, why do you use that with 'coerce' rather than 'raise'? Is it because you expect and are okay with a few bad values in your rows (but don't want to analysis/application to throw an exception)? If correctness is your main concern, wouldn't you want to use 'raise', since NaN rows will affect your analysis without making you aware of it?
– David
Nov 19 '18 at 22:38
1
@David, It's up to you. I make the assumption your data is, or should be clean. If it's not, you can use'raise'
.
– jpp
Nov 19 '18 at 23:58
add a comment |
Here's a reproducible example of what you're seeing:
x = pd.Series([1, None, 3, None, None])
print(x != None)
0 True
1 True
2 True
3 True
4 True
dtype: bool
What's not obvious is behind the scenes Pandas converts your series to numeric and converts those None
values to np.nan
:
print(x)
0 1.0
1 NaN
2 3.0
3 NaN
4 NaN
dtype: float64
The NumPy array underlying the series can then be held in a contiguous memory block and support vectorised operations. Since np.nan != np.nan
by design, your Boolean series will contain only True
values, even if you were to test against np.nan
instead of None
.
For efficiency and correctness, you should use pd.to_numeric
with isnull
/ notnull
for checking null values:
print(pd.to_numeric(x, errors='coerce').notnull())
0 True
1 False
2 True
3 False
4 False
dtype: bool
Though, even if theSeries
isn't initially convertedx = pd.Series(['1', None, 'hello', None, None])
, there must still be some conversion happening during the comparison?
– ALollz
Nov 12 '18 at 17:34
2
@ALollz, Yup, seems so. I haven't dug into the source forpd.Series.__eq__
, my instinct is there's custom logic & edge cases [which also explains why using NumPy array for comparisons is faster]. Best to avoid all this and usepd.to_numeric
.
– jpp
Nov 12 '18 at 17:36
@jpp Thank you very much! This is working well for me :) Regarding pd.to_numeric, why do you use that with 'coerce' rather than 'raise'? Is it because you expect and are okay with a few bad values in your rows (but don't want to analysis/application to throw an exception)? If correctness is your main concern, wouldn't you want to use 'raise', since NaN rows will affect your analysis without making you aware of it?
– David
Nov 19 '18 at 22:38
1
@David, It's up to you. I make the assumption your data is, or should be clean. If it's not, you can use'raise'
.
– jpp
Nov 19 '18 at 23:58
add a comment |
Here's a reproducible example of what you're seeing:
x = pd.Series([1, None, 3, None, None])
print(x != None)
0 True
1 True
2 True
3 True
4 True
dtype: bool
What's not obvious is behind the scenes Pandas converts your series to numeric and converts those None
values to np.nan
:
print(x)
0 1.0
1 NaN
2 3.0
3 NaN
4 NaN
dtype: float64
The NumPy array underlying the series can then be held in a contiguous memory block and support vectorised operations. Since np.nan != np.nan
by design, your Boolean series will contain only True
values, even if you were to test against np.nan
instead of None
.
For efficiency and correctness, you should use pd.to_numeric
with isnull
/ notnull
for checking null values:
print(pd.to_numeric(x, errors='coerce').notnull())
0 True
1 False
2 True
3 False
4 False
dtype: bool
Here's a reproducible example of what you're seeing:
x = pd.Series([1, None, 3, None, None])
print(x != None)
0 True
1 True
2 True
3 True
4 True
dtype: bool
What's not obvious is behind the scenes Pandas converts your series to numeric and converts those None
values to np.nan
:
print(x)
0 1.0
1 NaN
2 3.0
3 NaN
4 NaN
dtype: float64
The NumPy array underlying the series can then be held in a contiguous memory block and support vectorised operations. Since np.nan != np.nan
by design, your Boolean series will contain only True
values, even if you were to test against np.nan
instead of None
.
For efficiency and correctness, you should use pd.to_numeric
with isnull
/ notnull
for checking null values:
print(pd.to_numeric(x, errors='coerce').notnull())
0 True
1 False
2 True
3 False
4 False
dtype: bool
edited Nov 13 '18 at 0:49
answered Nov 12 '18 at 17:26
jpp
92.6k2054103
92.6k2054103
Though, even if theSeries
isn't initially convertedx = pd.Series(['1', None, 'hello', None, None])
, there must still be some conversion happening during the comparison?
– ALollz
Nov 12 '18 at 17:34
2
@ALollz, Yup, seems so. I haven't dug into the source forpd.Series.__eq__
, my instinct is there's custom logic & edge cases [which also explains why using NumPy array for comparisons is faster]. Best to avoid all this and usepd.to_numeric
.
– jpp
Nov 12 '18 at 17:36
@jpp Thank you very much! This is working well for me :) Regarding pd.to_numeric, why do you use that with 'coerce' rather than 'raise'? Is it because you expect and are okay with a few bad values in your rows (but don't want to analysis/application to throw an exception)? If correctness is your main concern, wouldn't you want to use 'raise', since NaN rows will affect your analysis without making you aware of it?
– David
Nov 19 '18 at 22:38
1
@David, It's up to you. I make the assumption your data is, or should be clean. If it's not, you can use'raise'
.
– jpp
Nov 19 '18 at 23:58
add a comment |
Though, even if theSeries
isn't initially convertedx = pd.Series(['1', None, 'hello', None, None])
, there must still be some conversion happening during the comparison?
– ALollz
Nov 12 '18 at 17:34
2
@ALollz, Yup, seems so. I haven't dug into the source forpd.Series.__eq__
, my instinct is there's custom logic & edge cases [which also explains why using NumPy array for comparisons is faster]. Best to avoid all this and usepd.to_numeric
.
– jpp
Nov 12 '18 at 17:36
@jpp Thank you very much! This is working well for me :) Regarding pd.to_numeric, why do you use that with 'coerce' rather than 'raise'? Is it because you expect and are okay with a few bad values in your rows (but don't want to analysis/application to throw an exception)? If correctness is your main concern, wouldn't you want to use 'raise', since NaN rows will affect your analysis without making you aware of it?
– David
Nov 19 '18 at 22:38
1
@David, It's up to you. I make the assumption your data is, or should be clean. If it's not, you can use'raise'
.
– jpp
Nov 19 '18 at 23:58
Though, even if the
Series
isn't initially converted x = pd.Series(['1', None, 'hello', None, None])
, there must still be some conversion happening during the comparison?– ALollz
Nov 12 '18 at 17:34
Though, even if the
Series
isn't initially converted x = pd.Series(['1', None, 'hello', None, None])
, there must still be some conversion happening during the comparison?– ALollz
Nov 12 '18 at 17:34
2
2
@ALollz, Yup, seems so. I haven't dug into the source for
pd.Series.__eq__
, my instinct is there's custom logic & edge cases [which also explains why using NumPy array for comparisons is faster]. Best to avoid all this and use pd.to_numeric
.– jpp
Nov 12 '18 at 17:36
@ALollz, Yup, seems so. I haven't dug into the source for
pd.Series.__eq__
, my instinct is there's custom logic & edge cases [which also explains why using NumPy array for comparisons is faster]. Best to avoid all this and use pd.to_numeric
.– jpp
Nov 12 '18 at 17:36
@jpp Thank you very much! This is working well for me :) Regarding pd.to_numeric, why do you use that with 'coerce' rather than 'raise'? Is it because you expect and are okay with a few bad values in your rows (but don't want to analysis/application to throw an exception)? If correctness is your main concern, wouldn't you want to use 'raise', since NaN rows will affect your analysis without making you aware of it?
– David
Nov 19 '18 at 22:38
@jpp Thank you very much! This is working well for me :) Regarding pd.to_numeric, why do you use that with 'coerce' rather than 'raise'? Is it because you expect and are okay with a few bad values in your rows (but don't want to analysis/application to throw an exception)? If correctness is your main concern, wouldn't you want to use 'raise', since NaN rows will affect your analysis without making you aware of it?
– David
Nov 19 '18 at 22:38
1
1
@David, It's up to you. I make the assumption your data is, or should be clean. If it's not, you can use
'raise'
.– jpp
Nov 19 '18 at 23:58
@David, It's up to you. I make the assumption your data is, or should be clean. If it's not, you can use
'raise'
.– jpp
Nov 19 '18 at 23:58
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53267064%2fpandas-boolean-operations-are-inconsistent-with-one-comparison-vs-many-comparis%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
I think the simplest solution is
df.Column.notnull()
asNone
is a null value recognized by pandas. Though I'm unsure as to why the element-wise comparison withNone
fails here. Using.values
worksdf.Column.values != None
– ALollz
Nov 12 '18 at 17:23
1
you can't make use of
dropna()
?– hootnot
Nov 12 '18 at 17:39
@hootnot That is what I am using now :)
– David
Nov 26 '18 at 2:17