Pandas dataframe select rows where a list-column contains any of a list of strings
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
I've got a pandas DataFrame that looks like this:
molecule species
0 a [dog]
1 b [horse, pig]
2 c [cat, dog]
3 d [cat, horse, pig]
4 e [chicken, pig]
and I like to extract a DataFrame containing only thoses rows, that contain any of selection = ['cat', 'dog']
. So the result should look like this:
molecule species
0 a [dog]
1 c [cat, dog]
2 d [cat, horse, pig]
What would be the simplest way to do this?
For testing:
selection = ['cat', 'dog']
df = pd.DataFrame({'molecule': ['a','b','c','d','e'], 'species' : [['dog'], ['horse','pig'],['cat', 'dog'], ['cat','horse','pig'], ['chicken','pig']]})
python pandas dataframe
add a comment |
I've got a pandas DataFrame that looks like this:
molecule species
0 a [dog]
1 b [horse, pig]
2 c [cat, dog]
3 d [cat, horse, pig]
4 e [chicken, pig]
and I like to extract a DataFrame containing only thoses rows, that contain any of selection = ['cat', 'dog']
. So the result should look like this:
molecule species
0 a [dog]
1 c [cat, dog]
2 d [cat, horse, pig]
What would be the simplest way to do this?
For testing:
selection = ['cat', 'dog']
df = pd.DataFrame({'molecule': ['a','b','c','d','e'], 'species' : [['dog'], ['horse','pig'],['cat', 'dog'], ['cat','horse','pig'], ['chicken','pig']]})
python pandas dataframe
1
Usedf = df.loc[df.species.str.contains('cat|dog'),:]
– Sandeep Kadapa
Nov 16 '18 at 17:44
add a comment |
I've got a pandas DataFrame that looks like this:
molecule species
0 a [dog]
1 b [horse, pig]
2 c [cat, dog]
3 d [cat, horse, pig]
4 e [chicken, pig]
and I like to extract a DataFrame containing only thoses rows, that contain any of selection = ['cat', 'dog']
. So the result should look like this:
molecule species
0 a [dog]
1 c [cat, dog]
2 d [cat, horse, pig]
What would be the simplest way to do this?
For testing:
selection = ['cat', 'dog']
df = pd.DataFrame({'molecule': ['a','b','c','d','e'], 'species' : [['dog'], ['horse','pig'],['cat', 'dog'], ['cat','horse','pig'], ['chicken','pig']]})
python pandas dataframe
I've got a pandas DataFrame that looks like this:
molecule species
0 a [dog]
1 b [horse, pig]
2 c [cat, dog]
3 d [cat, horse, pig]
4 e [chicken, pig]
and I like to extract a DataFrame containing only thoses rows, that contain any of selection = ['cat', 'dog']
. So the result should look like this:
molecule species
0 a [dog]
1 c [cat, dog]
2 d [cat, horse, pig]
What would be the simplest way to do this?
For testing:
selection = ['cat', 'dog']
df = pd.DataFrame({'molecule': ['a','b','c','d','e'], 'species' : [['dog'], ['horse','pig'],['cat', 'dog'], ['cat','horse','pig'], ['chicken','pig']]})
python pandas dataframe
python pandas dataframe
asked Nov 16 '18 at 17:29
NicoHNicoH
9517
9517
1
Usedf = df.loc[df.species.str.contains('cat|dog'),:]
– Sandeep Kadapa
Nov 16 '18 at 17:44
add a comment |
1
Usedf = df.loc[df.species.str.contains('cat|dog'),:]
– Sandeep Kadapa
Nov 16 '18 at 17:44
1
1
Use
df = df.loc[df.species.str.contains('cat|dog'),:]
– Sandeep Kadapa
Nov 16 '18 at 17:44
Use
df = df.loc[df.species.str.contains('cat|dog'),:]
– Sandeep Kadapa
Nov 16 '18 at 17:44
add a comment |
6 Answers
6
active
oldest
votes
IIUC Re-create your df then using isin
with any
should be faster than apply
df[pd.DataFrame(df.species.tolist()).isin(selection).any(1)]
Out[64]:
molecule species
0 a [dog]
2 c [cat, dog]
3 d [cat, horse, pig]
add a comment |
Using Numpy would be much faster than using Pandas in this case,
Option 1: Using numpy intersection,
mask = df.species.apply(lambda x: np.intersect1d(x, selection).size > 0)
df[mask]
450 µs ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
molecule species
0 a [dog]
2 c [cat, dog]
3 d [cat, horse, pig]
Option2: A similar solution as above using numpy in1d,
df[df.species.apply(lambda x: np.any(np.in1d(x, selection)))]
420 µs ± 17.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Option 3: Interestingly, using pure python set is quite fast here
df[df.species.apply(lambda x: bool(set(x) & set(selection)))]
305 µs ± 5.22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
add a comment |
You can use mask
with apply
here.
selection = ['cat', 'dog']
mask = df.species.apply(lambda x: any(item for item in selection if item in x))
df1 = df[mask]
For the DataFrame you've provided as an example above, df1 will be:
molecule species
0 a [dog]
2 c [cat, dog]
3 d [cat, horse, pig]
1
Given that @NicoH is looking for the presence of 'cat' or 'dog', i would recommend changing the mask to thismask = df.species.apply(lambda x: any(item for item in selection if item in x))
– rs311
Nov 16 '18 at 17:40
@rs311 agreed - updated the lambda with selection example
– Wes Doyle
Nov 16 '18 at 17:43
add a comment |
This is an easy and basic approach.
You can create a function that checks if the elements in Selection list are present in the pandas column list.
def check(speciesList):
flag = False
for animal in selection:
if animal in speciesList:
flag = True
return flag
You could then use this list to create a column that contains True of False based on whether the record contains at least one element in Selection List and create a new data frame based on it.
df['containsCatDog'] = df.species.apply(lambda animals: check(animals))
newDf = df[df.containsCatDog == True]
Hope it helps.
add a comment |
import pandas as pd
import numpy as np
selection = ['cat', 'dog']
df = pd.DataFrame({'molecule': ['a','b','c','d','e'], 'species' : [['dog'], ['horse','pig'],['cat', 'dog'], ['cat','horse','pig'], ['chicken','pig']]})
df1 = df[df['species'].apply((lambda x: 'dog' in x) )]
df2=df[df['species'].apply((lambda x: 'cat' in x) )]
frames = [df1, df2]
result = pd.concat(frames,join='inner',ignore_index=False)
print("result",result)
result = result[~result.index.duplicated(keep='first')]
print(result)
add a comment |
Using pandas str.contains
(uses regular expression):
df[~df["species"].str.contains('(cat|dog)', regex=True)]
Output:
molecule species
1 b [horse, pig]
4 e [chicken, pig]
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53342715%2fpandas-dataframe-select-rows-where-a-list-column-contains-any-of-a-list-of-strin%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
6 Answers
6
active
oldest
votes
6 Answers
6
active
oldest
votes
active
oldest
votes
active
oldest
votes
IIUC Re-create your df then using isin
with any
should be faster than apply
df[pd.DataFrame(df.species.tolist()).isin(selection).any(1)]
Out[64]:
molecule species
0 a [dog]
2 c [cat, dog]
3 d [cat, horse, pig]
add a comment |
IIUC Re-create your df then using isin
with any
should be faster than apply
df[pd.DataFrame(df.species.tolist()).isin(selection).any(1)]
Out[64]:
molecule species
0 a [dog]
2 c [cat, dog]
3 d [cat, horse, pig]
add a comment |
IIUC Re-create your df then using isin
with any
should be faster than apply
df[pd.DataFrame(df.species.tolist()).isin(selection).any(1)]
Out[64]:
molecule species
0 a [dog]
2 c [cat, dog]
3 d [cat, horse, pig]
IIUC Re-create your df then using isin
with any
should be faster than apply
df[pd.DataFrame(df.species.tolist()).isin(selection).any(1)]
Out[64]:
molecule species
0 a [dog]
2 c [cat, dog]
3 d [cat, horse, pig]
answered Nov 16 '18 at 17:56
Wen-BenWen-Ben
126k83872
126k83872
add a comment |
add a comment |
Using Numpy would be much faster than using Pandas in this case,
Option 1: Using numpy intersection,
mask = df.species.apply(lambda x: np.intersect1d(x, selection).size > 0)
df[mask]
450 µs ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
molecule species
0 a [dog]
2 c [cat, dog]
3 d [cat, horse, pig]
Option2: A similar solution as above using numpy in1d,
df[df.species.apply(lambda x: np.any(np.in1d(x, selection)))]
420 µs ± 17.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Option 3: Interestingly, using pure python set is quite fast here
df[df.species.apply(lambda x: bool(set(x) & set(selection)))]
305 µs ± 5.22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
add a comment |
Using Numpy would be much faster than using Pandas in this case,
Option 1: Using numpy intersection,
mask = df.species.apply(lambda x: np.intersect1d(x, selection).size > 0)
df[mask]
450 µs ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
molecule species
0 a [dog]
2 c [cat, dog]
3 d [cat, horse, pig]
Option2: A similar solution as above using numpy in1d,
df[df.species.apply(lambda x: np.any(np.in1d(x, selection)))]
420 µs ± 17.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Option 3: Interestingly, using pure python set is quite fast here
df[df.species.apply(lambda x: bool(set(x) & set(selection)))]
305 µs ± 5.22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
add a comment |
Using Numpy would be much faster than using Pandas in this case,
Option 1: Using numpy intersection,
mask = df.species.apply(lambda x: np.intersect1d(x, selection).size > 0)
df[mask]
450 µs ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
molecule species
0 a [dog]
2 c [cat, dog]
3 d [cat, horse, pig]
Option2: A similar solution as above using numpy in1d,
df[df.species.apply(lambda x: np.any(np.in1d(x, selection)))]
420 µs ± 17.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Option 3: Interestingly, using pure python set is quite fast here
df[df.species.apply(lambda x: bool(set(x) & set(selection)))]
305 µs ± 5.22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Using Numpy would be much faster than using Pandas in this case,
Option 1: Using numpy intersection,
mask = df.species.apply(lambda x: np.intersect1d(x, selection).size > 0)
df[mask]
450 µs ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
molecule species
0 a [dog]
2 c [cat, dog]
3 d [cat, horse, pig]
Option2: A similar solution as above using numpy in1d,
df[df.species.apply(lambda x: np.any(np.in1d(x, selection)))]
420 µs ± 17.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Option 3: Interestingly, using pure python set is quite fast here
df[df.species.apply(lambda x: bool(set(x) & set(selection)))]
305 µs ± 5.22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
edited Nov 16 '18 at 18:04
answered Nov 16 '18 at 17:53
VaishaliVaishali
22.9k41438
22.9k41438
add a comment |
add a comment |
You can use mask
with apply
here.
selection = ['cat', 'dog']
mask = df.species.apply(lambda x: any(item for item in selection if item in x))
df1 = df[mask]
For the DataFrame you've provided as an example above, df1 will be:
molecule species
0 a [dog]
2 c [cat, dog]
3 d [cat, horse, pig]
1
Given that @NicoH is looking for the presence of 'cat' or 'dog', i would recommend changing the mask to thismask = df.species.apply(lambda x: any(item for item in selection if item in x))
– rs311
Nov 16 '18 at 17:40
@rs311 agreed - updated the lambda with selection example
– Wes Doyle
Nov 16 '18 at 17:43
add a comment |
You can use mask
with apply
here.
selection = ['cat', 'dog']
mask = df.species.apply(lambda x: any(item for item in selection if item in x))
df1 = df[mask]
For the DataFrame you've provided as an example above, df1 will be:
molecule species
0 a [dog]
2 c [cat, dog]
3 d [cat, horse, pig]
1
Given that @NicoH is looking for the presence of 'cat' or 'dog', i would recommend changing the mask to thismask = df.species.apply(lambda x: any(item for item in selection if item in x))
– rs311
Nov 16 '18 at 17:40
@rs311 agreed - updated the lambda with selection example
– Wes Doyle
Nov 16 '18 at 17:43
add a comment |
You can use mask
with apply
here.
selection = ['cat', 'dog']
mask = df.species.apply(lambda x: any(item for item in selection if item in x))
df1 = df[mask]
For the DataFrame you've provided as an example above, df1 will be:
molecule species
0 a [dog]
2 c [cat, dog]
3 d [cat, horse, pig]
You can use mask
with apply
here.
selection = ['cat', 'dog']
mask = df.species.apply(lambda x: any(item for item in selection if item in x))
df1 = df[mask]
For the DataFrame you've provided as an example above, df1 will be:
molecule species
0 a [dog]
2 c [cat, dog]
3 d [cat, horse, pig]
edited Nov 16 '18 at 17:42
answered Nov 16 '18 at 17:34
Wes DoyleWes Doyle
1,1092721
1,1092721
1
Given that @NicoH is looking for the presence of 'cat' or 'dog', i would recommend changing the mask to thismask = df.species.apply(lambda x: any(item for item in selection if item in x))
– rs311
Nov 16 '18 at 17:40
@rs311 agreed - updated the lambda with selection example
– Wes Doyle
Nov 16 '18 at 17:43
add a comment |
1
Given that @NicoH is looking for the presence of 'cat' or 'dog', i would recommend changing the mask to thismask = df.species.apply(lambda x: any(item for item in selection if item in x))
– rs311
Nov 16 '18 at 17:40
@rs311 agreed - updated the lambda with selection example
– Wes Doyle
Nov 16 '18 at 17:43
1
1
Given that @NicoH is looking for the presence of 'cat' or 'dog', i would recommend changing the mask to this
mask = df.species.apply(lambda x: any(item for item in selection if item in x))
– rs311
Nov 16 '18 at 17:40
Given that @NicoH is looking for the presence of 'cat' or 'dog', i would recommend changing the mask to this
mask = df.species.apply(lambda x: any(item for item in selection if item in x))
– rs311
Nov 16 '18 at 17:40
@rs311 agreed - updated the lambda with selection example
– Wes Doyle
Nov 16 '18 at 17:43
@rs311 agreed - updated the lambda with selection example
– Wes Doyle
Nov 16 '18 at 17:43
add a comment |
This is an easy and basic approach.
You can create a function that checks if the elements in Selection list are present in the pandas column list.
def check(speciesList):
flag = False
for animal in selection:
if animal in speciesList:
flag = True
return flag
You could then use this list to create a column that contains True of False based on whether the record contains at least one element in Selection List and create a new data frame based on it.
df['containsCatDog'] = df.species.apply(lambda animals: check(animals))
newDf = df[df.containsCatDog == True]
Hope it helps.
add a comment |
This is an easy and basic approach.
You can create a function that checks if the elements in Selection list are present in the pandas column list.
def check(speciesList):
flag = False
for animal in selection:
if animal in speciesList:
flag = True
return flag
You could then use this list to create a column that contains True of False based on whether the record contains at least one element in Selection List and create a new data frame based on it.
df['containsCatDog'] = df.species.apply(lambda animals: check(animals))
newDf = df[df.containsCatDog == True]
Hope it helps.
add a comment |
This is an easy and basic approach.
You can create a function that checks if the elements in Selection list are present in the pandas column list.
def check(speciesList):
flag = False
for animal in selection:
if animal in speciesList:
flag = True
return flag
You could then use this list to create a column that contains True of False based on whether the record contains at least one element in Selection List and create a new data frame based on it.
df['containsCatDog'] = df.species.apply(lambda animals: check(animals))
newDf = df[df.containsCatDog == True]
Hope it helps.
This is an easy and basic approach.
You can create a function that checks if the elements in Selection list are present in the pandas column list.
def check(speciesList):
flag = False
for animal in selection:
if animal in speciesList:
flag = True
return flag
You could then use this list to create a column that contains True of False based on whether the record contains at least one element in Selection List and create a new data frame based on it.
df['containsCatDog'] = df.species.apply(lambda animals: check(animals))
newDf = df[df.containsCatDog == True]
Hope it helps.
answered Nov 16 '18 at 17:55
CommandCommand
608
608
add a comment |
add a comment |
import pandas as pd
import numpy as np
selection = ['cat', 'dog']
df = pd.DataFrame({'molecule': ['a','b','c','d','e'], 'species' : [['dog'], ['horse','pig'],['cat', 'dog'], ['cat','horse','pig'], ['chicken','pig']]})
df1 = df[df['species'].apply((lambda x: 'dog' in x) )]
df2=df[df['species'].apply((lambda x: 'cat' in x) )]
frames = [df1, df2]
result = pd.concat(frames,join='inner',ignore_index=False)
print("result",result)
result = result[~result.index.duplicated(keep='first')]
print(result)
add a comment |
import pandas as pd
import numpy as np
selection = ['cat', 'dog']
df = pd.DataFrame({'molecule': ['a','b','c','d','e'], 'species' : [['dog'], ['horse','pig'],['cat', 'dog'], ['cat','horse','pig'], ['chicken','pig']]})
df1 = df[df['species'].apply((lambda x: 'dog' in x) )]
df2=df[df['species'].apply((lambda x: 'cat' in x) )]
frames = [df1, df2]
result = pd.concat(frames,join='inner',ignore_index=False)
print("result",result)
result = result[~result.index.duplicated(keep='first')]
print(result)
add a comment |
import pandas as pd
import numpy as np
selection = ['cat', 'dog']
df = pd.DataFrame({'molecule': ['a','b','c','d','e'], 'species' : [['dog'], ['horse','pig'],['cat', 'dog'], ['cat','horse','pig'], ['chicken','pig']]})
df1 = df[df['species'].apply((lambda x: 'dog' in x) )]
df2=df[df['species'].apply((lambda x: 'cat' in x) )]
frames = [df1, df2]
result = pd.concat(frames,join='inner',ignore_index=False)
print("result",result)
result = result[~result.index.duplicated(keep='first')]
print(result)
import pandas as pd
import numpy as np
selection = ['cat', 'dog']
df = pd.DataFrame({'molecule': ['a','b','c','d','e'], 'species' : [['dog'], ['horse','pig'],['cat', 'dog'], ['cat','horse','pig'], ['chicken','pig']]})
df1 = df[df['species'].apply((lambda x: 'dog' in x) )]
df2=df[df['species'].apply((lambda x: 'cat' in x) )]
frames = [df1, df2]
result = pd.concat(frames,join='inner',ignore_index=False)
print("result",result)
result = result[~result.index.duplicated(keep='first')]
print(result)
answered Nov 16 '18 at 20:03
ALEN M AALEN M A
92
92
add a comment |
add a comment |
Using pandas str.contains
(uses regular expression):
df[~df["species"].str.contains('(cat|dog)', regex=True)]
Output:
molecule species
1 b [horse, pig]
4 e [chicken, pig]
add a comment |
Using pandas str.contains
(uses regular expression):
df[~df["species"].str.contains('(cat|dog)', regex=True)]
Output:
molecule species
1 b [horse, pig]
4 e [chicken, pig]
add a comment |
Using pandas str.contains
(uses regular expression):
df[~df["species"].str.contains('(cat|dog)', regex=True)]
Output:
molecule species
1 b [horse, pig]
4 e [chicken, pig]
Using pandas str.contains
(uses regular expression):
df[~df["species"].str.contains('(cat|dog)', regex=True)]
Output:
molecule species
1 b [horse, pig]
4 e [chicken, pig]
answered Nov 16 '18 at 19:30
Ken DekalbKen Dekalb
317112
317112
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53342715%2fpandas-dataframe-select-rows-where-a-list-column-contains-any-of-a-list-of-strin%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Use
df = df.loc[df.species.str.contains('cat|dog'),:]
– Sandeep Kadapa
Nov 16 '18 at 17:44