Pandas boolean operations are inconsistent with one comparison vs. many comparisons

I am trying to filter out some rows in my dataframe (with > 400000 rows) where values in one column have the None type. The goal is to leave my dataframe with only rows that have values that are float in the 'Column' column. I plan on doing this by passing in an array of booleans, except that I can't construct my array of booleans properly (they all come back True).

When I run the following operation, given a value of i within the df range, the comparison works:

df.loc[i, 'Column'] != None

The rows that have a value of None in 'Column' give the results False.

But when I run this operation:

df.loc[0:len(df), 'Column'] != None

The boolean array comes back as all True.

Why is this? Is this a pandas bug? An edge case? Intended behaviour for reasons I don't understand?

I can think of other ways to construct my boolean array, though this seems the most efficient. But it bothers me that this is the result I am getting.

asked Nov 12 '18 at 17:16

David

858

1

I think the simplest solution is df.Column.notnull() as None is a null value recognized by pandas. Though I'm unsure as to why the element-wise comparison with None fails here. Using .values works df.Column.values != None
– ALollz
Nov 12 '18 at 17:23

1

you can't make use of dropna() ?
– hootnot
Nov 12 '18 at 17:39

@hootnot That is what I am using now :)
– David
Nov 26 '18 at 2:17

add a comment |

When I run the following operation, given a value of i within the df range, the comparison works:

df.loc[i, 'Column'] != None

The rows that have a value of None in 'Column' give the results False.

But when I run this operation:

df.loc[0:len(df), 'Column'] != None

The boolean array comes back as all True.

Why is this? Is this a pandas bug? An edge case? Intended behaviour for reasons I don't understand?

I can think of other ways to construct my boolean array, though this seems the most efficient. But it bothers me that this is the result I am getting.

asked Nov 12 '18 at 17:16

David

858

1

I think the simplest solution is df.Column.notnull() as None is a null value recognized by pandas. Though I'm unsure as to why the element-wise comparison with None fails here. Using .values works df.Column.values != None
– ALollz
Nov 12 '18 at 17:23

1

you can't make use of dropna() ?
– hootnot
Nov 12 '18 at 17:39

@hootnot That is what I am using now :)
– David
Nov 26 '18 at 2:17

add a comment |

When I run the following operation, given a value of i within the df range, the comparison works:

df.loc[i, 'Column'] != None

The rows that have a value of None in 'Column' give the results False.

But when I run this operation:

df.loc[0:len(df), 'Column'] != None

The boolean array comes back as all True.

Why is this? Is this a pandas bug? An edge case? Intended behaviour for reasons I don't understand?

I can think of other ways to construct my boolean array, though this seems the most efficient. But it bothers me that this is the result I am getting.

asked Nov 12 '18 at 17:16

David

858

When I run the following operation, given a value of i within the df range, the comparison works:

df.loc[i, 'Column'] != None

The rows that have a value of None in 'Column' give the results False.

But when I run this operation:

df.loc[0:len(df), 'Column'] != None

The boolean array comes back as all True.

Why is this? Is this a pandas bug? An edge case? Intended behaviour for reasons I don't understand?

I can think of other ways to construct my boolean array, though this seems the most efficient. But it bothers me that this is the result I am getting.

python pandas boolean-operations

asked Nov 12 '18 at 17:16

David

858

asked Nov 12 '18 at 17:16

David

858

asked Nov 12 '18 at 17:16

David

858

asked Nov 12 '18 at 17:16

David

858

asked Nov 12 '18 at 17:16

David

858

1

I think the simplest solution is df.Column.notnull() as None is a null value recognized by pandas. Though I'm unsure as to why the element-wise comparison with None fails here. Using .values works df.Column.values != None
– ALollz
Nov 12 '18 at 17:23

1

you can't make use of dropna() ?
– hootnot
Nov 12 '18 at 17:39

@hootnot That is what I am using now :)
– David
Nov 26 '18 at 2:17

add a comment |

1

I think the simplest solution is df.Column.notnull() as None is a null value recognized by pandas. Though I'm unsure as to why the element-wise comparison with None fails here. Using .values works df.Column.values != None
– ALollz
Nov 12 '18 at 17:23

1

you can't make use of dropna() ?
– hootnot
Nov 12 '18 at 17:39

@hootnot That is what I am using now :)
– David
Nov 26 '18 at 2:17

I think the simplest solution is df.Column.notnull() as None is a null value recognized by pandas. Though I'm unsure as to why the element-wise comparison with None fails here. Using .values works df.Column.values != None
– ALollz
Nov 12 '18 at 17:23

you can't make use of dropna() ?
– hootnot
Nov 12 '18 at 17:39

@hootnot That is what I am using now :)
– David
Nov 26 '18 at 2:17

add a comment |

1 Answer
1

active

oldest

votes

Here's a reproducible example of what you're seeing:

x = pd.Series([1, None, 3, None, None])



print(x != None)



0    True

1    True

2    True

3    True

4    True

dtype: bool

What's not obvious is behind the scenes Pandas converts your series to numeric and converts those None values to np.nan:

print(x)



0    1.0

1    NaN

2    3.0

3    NaN

4    NaN

dtype: float64

The NumPy array underlying the series can then be held in a contiguous memory block and support vectorised operations. Since np.nan != np.nan by design, your Boolean series will contain only True values, even if you were to test against np.nan instead of None.

For efficiency and correctness, you should use pd.to_numeric with isnull / notnull for checking null values:

print(pd.to_numeric(x, errors='coerce').notnull())



0     True

1    False

2     True

3    False

4    False

dtype: bool

edited Nov 13 '18 at 0:49

answered Nov 12 '18 at 17:26

jpp

92.6k2054103

Though, even if the Series isn't initially converted x = pd.Series(['1', None, 'hello', None, None]), there must still be some conversion happening during the comparison?
– ALollz
Nov 12 '18 at 17:34

2

@ALollz, Yup, seems so. I haven't dug into the source for pd.Series.__eq__, my instinct is there's custom logic & edge cases [which also explains why using NumPy array for comparisons is faster]. Best to avoid all this and use pd.to_numeric.
– jpp
Nov 12 '18 at 17:36

@jpp Thank you very much! This is working well for me :) Regarding pd.to_numeric, why do you use that with 'coerce' rather than 'raise'? Is it because you expect and are okay with a few bad values in your rows (but don't want to analysis/application to throw an exception)? If correctness is your main concern, wouldn't you want to use 'raise', since NaN rows will affect your analysis without making you aware of it?
– David
Nov 19 '18 at 22:38

1

@David, It's up to you. I make the assumption your data is, or should be clean. If it's not, you can use 'raise'.
– jpp
Nov 19 '18 at 23:58

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53267064%2fpandas-boolean-operations-are-inconsistent-with-one-comparison-vs-many-comparis%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Here's a reproducible example of what you're seeing:

x = pd.Series([1, None, 3, None, None])



print(x != None)



0    True

1    True

2    True

3    True

4    True

dtype: bool

What's not obvious is behind the scenes Pandas converts your series to numeric and converts those None values to np.nan:

print(x)



0    1.0

1    NaN

2    3.0

3    NaN

4    NaN

dtype: float64

For efficiency and correctness, you should use pd.to_numeric with isnull / notnull for checking null values:

print(pd.to_numeric(x, errors='coerce').notnull())



0     True

1    False

2     True

3    False

4    False

dtype: bool

edited Nov 13 '18 at 0:49

answered Nov 12 '18 at 17:26

jpp

92.6k2054103

Though, even if the Series isn't initially converted x = pd.Series(['1', None, 'hello', None, None]), there must still be some conversion happening during the comparison?
– ALollz
Nov 12 '18 at 17:34

2

@ALollz, Yup, seems so. I haven't dug into the source for pd.Series.__eq__, my instinct is there's custom logic & edge cases [which also explains why using NumPy array for comparisons is faster]. Best to avoid all this and use pd.to_numeric.
– jpp
Nov 12 '18 at 17:36

@jpp Thank you very much! This is working well for me :) Regarding pd.to_numeric, why do you use that with 'coerce' rather than 'raise'? Is it because you expect and are okay with a few bad values in your rows (but don't want to analysis/application to throw an exception)? If correctness is your main concern, wouldn't you want to use 'raise', since NaN rows will affect your analysis without making you aware of it?
– David
Nov 19 '18 at 22:38

1

@David, It's up to you. I make the assumption your data is, or should be clean. If it's not, you can use 'raise'.
– jpp
Nov 19 '18 at 23:58

add a comment |

Here's a reproducible example of what you're seeing:

x = pd.Series([1, None, 3, None, None])



print(x != None)



0    True

1    True

2    True

3    True

4    True

dtype: bool

What's not obvious is behind the scenes Pandas converts your series to numeric and converts those None values to np.nan:

print(x)



0    1.0

1    NaN

2    3.0

3    NaN

4    NaN

dtype: float64

For efficiency and correctness, you should use pd.to_numeric with isnull / notnull for checking null values:

print(pd.to_numeric(x, errors='coerce').notnull())



0     True

1    False

2     True

3    False

4    False

dtype: bool

edited Nov 13 '18 at 0:49

answered Nov 12 '18 at 17:26

jpp

92.6k2054103

Though, even if the Series isn't initially converted x = pd.Series(['1', None, 'hello', None, None]), there must still be some conversion happening during the comparison?
– ALollz
Nov 12 '18 at 17:34

2

@ALollz, Yup, seems so. I haven't dug into the source for pd.Series.__eq__, my instinct is there's custom logic & edge cases [which also explains why using NumPy array for comparisons is faster]. Best to avoid all this and use pd.to_numeric.
– jpp
Nov 12 '18 at 17:36

@jpp Thank you very much! This is working well for me :) Regarding pd.to_numeric, why do you use that with 'coerce' rather than 'raise'? Is it because you expect and are okay with a few bad values in your rows (but don't want to analysis/application to throw an exception)? If correctness is your main concern, wouldn't you want to use 'raise', since NaN rows will affect your analysis without making you aware of it?
– David
Nov 19 '18 at 22:38

1

@David, It's up to you. I make the assumption your data is, or should be clean. If it's not, you can use 'raise'.
– jpp
Nov 19 '18 at 23:58

add a comment |

Here's a reproducible example of what you're seeing:

x = pd.Series([1, None, 3, None, None])



print(x != None)



0    True

1    True

2    True

3    True

4    True

dtype: bool

What's not obvious is behind the scenes Pandas converts your series to numeric and converts those None values to np.nan:

print(x)



0    1.0

1    NaN

2    3.0

3    NaN

4    NaN

dtype: float64

For efficiency and correctness, you should use pd.to_numeric with isnull / notnull for checking null values:

print(pd.to_numeric(x, errors='coerce').notnull())



0     True

1    False

2     True

3    False

4    False

dtype: bool

edited Nov 13 '18 at 0:49

answered Nov 12 '18 at 17:26

jpp

92.6k2054103

Here's a reproducible example of what you're seeing:

x = pd.Series([1, None, 3, None, None])



print(x != None)



0    True

1    True

2    True

3    True

4    True

dtype: bool

What's not obvious is behind the scenes Pandas converts your series to numeric and converts those None values to np.nan:

print(x)



0    1.0

1    NaN

2    3.0

3    NaN

4    NaN

dtype: float64

For efficiency and correctness, you should use pd.to_numeric with isnull / notnull for checking null values:

print(pd.to_numeric(x, errors='coerce').notnull())



0     True

1    False

2     True

3    False

4    False

dtype: bool

edited Nov 13 '18 at 0:49

answered Nov 12 '18 at 17:26

jpp

92.6k2054103

edited Nov 13 '18 at 0:49

answered Nov 12 '18 at 17:26

jpp

92.6k2054103

answered Nov 12 '18 at 17:26

jpp

92.6k2054103

answered Nov 12 '18 at 17:26

jpp

92.6k2054103

Though, even if the Series isn't initially converted x = pd.Series(['1', None, 'hello', None, None]), there must still be some conversion happening during the comparison?
– ALollz
Nov 12 '18 at 17:34

2

@ALollz, Yup, seems so. I haven't dug into the source for pd.Series.__eq__, my instinct is there's custom logic & edge cases [which also explains why using NumPy array for comparisons is faster]. Best to avoid all this and use pd.to_numeric.
– jpp
Nov 12 '18 at 17:36

@jpp Thank you very much! This is working well for me :) Regarding pd.to_numeric, why do you use that with 'coerce' rather than 'raise'? Is it because you expect and are okay with a few bad values in your rows (but don't want to analysis/application to throw an exception)? If correctness is your main concern, wouldn't you want to use 'raise', since NaN rows will affect your analysis without making you aware of it?
– David
Nov 19 '18 at 22:38

1

@David, It's up to you. I make the assumption your data is, or should be clean. If it's not, you can use 'raise'.
– jpp
Nov 19 '18 at 23:58

add a comment |

Though, even if the Series isn't initially converted x = pd.Series(['1', None, 'hello', None, None]), there must still be some conversion happening during the comparison?
– ALollz
Nov 12 '18 at 17:34

2

@ALollz, Yup, seems so. I haven't dug into the source for pd.Series.__eq__, my instinct is there's custom logic & edge cases [which also explains why using NumPy array for comparisons is faster]. Best to avoid all this and use pd.to_numeric.
– jpp
Nov 12 '18 at 17:36

@jpp Thank you very much! This is working well for me :) Regarding pd.to_numeric, why do you use that with 'coerce' rather than 'raise'? Is it because you expect and are okay with a few bad values in your rows (but don't want to analysis/application to throw an exception)? If correctness is your main concern, wouldn't you want to use 'raise', since NaN rows will affect your analysis without making you aware of it?
– David
Nov 19 '18 at 22:38

1

@David, It's up to you. I make the assumption your data is, or should be clean. If it's not, you can use 'raise'.
– jpp
Nov 19 '18 at 23:58

Though, even if the Series isn't initially converted x = pd.Series(['1', None, 'hello', None, None]), there must still be some conversion happening during the comparison?
– ALollz
Nov 12 '18 at 17:34

@ALollz, Yup, seems so. I haven't dug into the source for pd.Series.__eq__, my instinct is there's custom logic & edge cases [which also explains why using NumPy array for comparisons is faster]. Best to avoid all this and use pd.to_numeric.
– jpp
Nov 12 '18 at 17:36

@jpp Thank you very much! This is working well for me :) Regarding pd.to_numeric, why do you use that with 'coerce' rather than 'raise'? Is it because you expect and are okay with a few bad values in your rows (but don't want to analysis/application to throw an exception)? If correctness is your main concern, wouldn't you want to use 'raise', since NaN rows will affect your analysis without making you aware of it?
– David
Nov 19 '18 at 22:38

@David, It's up to you. I make the assumption your data is, or should be clean. If it's not, you can use 'raise'.
– jpp
Nov 19 '18 at 23:58

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

d eZYYFLLabo,Evvdt,oz cRHH5eT 7IOXlJsP qu6 Os5Of femCLb

搜尋此網誌

Ndtyjky