complex pandas sub setting; selecting rows that match criteria in many columns












1















I'm selecting data from a pandas dataframe which is ~1.5M rows by 22 columns. Each column is a sample and each row is an observation about a mutation. 1.0 means the sample has a mutation, 0.0 means the sample does not have the specific mutation and 0.5 means there is no data on that sample for that mutation.



The samples are from one of 3 tissue types, lets call the tissues AE, BE and HE. The samples fit into the categories as follows:



AE=["X14AE","X15AE","X22AE","X23AE","AE21.35","AE36.45","AE46.55","AE61.80",]
BE=["X161724BE","BE1.2","BE1.8","BE2","BE9.13"]
HE=["X11HE","X18HE","HE17.24","HE2.4.5.6","HE8.15","HE8.9"]


I've made the following queries with pandas, they all work but seem very clunky:



-Get variants in AE and in HE but not in BE
-Get variants in AE and in BE but not in HE
-Get variants in BE and in HE but not in AE


Here is the code:



    """Get variants in AE and in HE but not in BE"""
AE_HE_notBE = df.loc[((df["X14AE"] == 1.0) | (df["X15AE"] == 1.0) | (df["X22AE"] == 1.0) |
(df["X23AE"] == 1.0) | (df["AE21.35"] == 1.0) | (df["AE36.45"] == 1.0) | (df["AE61.80"] == 1.0)) &
((df["X11HE"] == 1.0) | (df["X18HE"] == 1.0) |(df["HE17.24"] == 1.0) |(df["HE2.4.5.6"] == 1.0) |
(df["HE8.15"] == 1.0) | (df["HE8.9"] == 1.0)) & ((df["X161724BE"] != 1.0) & (df["BE1.2"] != 1.0) &
(df["BE1.8"] != 1.0) & (df["BE2"] != 1.0) & (df["BE9.13"] != 1.0)) & ((df["X161724BE"] != 0.5) | (df["BE1.2"] != 0.5) |
(df["BE1.8"] != 0.5) | (df["BE2"] != 0.5) | (df["BE9.13"] != 0.5))]


"""Get variants in AE and in BE but not in HE"""
AE_BE_notHE = df.loc[((df["X14AE"] == 1.0) | (df["X15AE"] == 1.0) | (df["X22AE"] == 1.0) |
(df["X23AE"] == 1.0) | (df["AE21.35"] == 1.0) | (df["AE36.45"] == 1.0) | (df["AE61.80"] == 1.0)) &
((df["X11HE"] != 1.0) & (df["X18HE"] != 1.0) &(df["HE17.24"] != 1.0) & (df["HE2.4.5.6"] != 1.0) &
(df["HE8.15"] != 1.0) & (df["HE8.9"] != 1.0)) &
((df["X11HE"] != 0.5) | (df["X18HE"] != 0.5) |(df["HE17.24"] != 0.5) |(df["HE2.4.5.6"] != 0.5) |
(df["HE8.15"] != 0.5) | (df["HE8.9"] != 0.5)) &
((df["X161724BE"] == 1.0) | (df["BE1.2"] == 1.0) |
(df["BE1.8"] == 1.0) | (df["BE2"] != 1.0) | (df["BE9.13"] == 1.0))]

"""Get variants in BE and in HE but not in AE"""
BE_HE_notAE = df.loc[((df["X161724BE"] == 1.0) | (df["BE1.2"] == 1.0) |
(df["BE1.8"] == 1.0) | (df["BE2"] != 1.0) | (df["BE9.13"] == 1.0)) &
((df["X11HE"] == 1.0) | (df["X18HE"] == 1.0) |(df["HE17.24"] == 1.0) |(df["HE2.4.5.6"] == 1.0) |
(df["HE8.15"] == 1.0) | (df["HE8.9"] == 1.0)) &
((df["X14AE"] != 1.0) & (df["X15AE"] != 1.0) & (df["X22AE"] != 1.0) &
(df["X23AE"] != 1.0) & (df["AE21.35"] != 1.0) & (df["AE36.45"] != 1.0) & (df["AE61.80"] != 1.0)) &
((df["X14AE"] != 0.5) | (df["X15AE"] != 0.5) | (df["X22AE"] != 0.5) |
(df["X23AE"] != 0.5) | (df["AE21.35"] != 0.5) | (df["AE36.45"] != 0.5) | (df["AE61.80"] != 0.5))]


This works fine but seems incredibly clunky and just not very elegant and if I need to change a few things (such as sample names) it is time consuming to re-write, can anyone help me with a simple way to re-write this query? I was wondering if there is a way where I could just pass in each list with a criteria? Something like:



AE_HE_notBE = df.loc[((df.[at least 1 sample from AE_list] == 1.0) & (df.[at least 1 sample from HE_list] == 1.0) & (df.[no sample from BE_list] == 1.0) & (df.[at least 1 sample from BE_list] == 0.0))]


I find I need to subset rows based on multiple columns where the columns can be grouped quite regularly so if anyone can make this kind of query more concise I would really appreciate it. Many thanks



#

EDIT: Minimal example as requested:



mutations=[[1,1,0,0,0.5,0],
[1,0,0,0,1,0],
[1,1,0,0.5,0,0],
[0,0.5,0,1,0,1],
[0,1,0,0,0,0],
[1,0,0,0,0,0],
[1,0,1,0,1,0],
[0,0,0,1,0.5,1],
[0,1,1,1,0,0],
[1,0.5,0,1,0,0]]

import string
import pandas as pd
m_list=[x for x in string.ascii_lowercase[:10]]

df=pd.DataFrame(columns=['AE1','AE2','BE1','BE2','HE1','HE2'])
for m,n in zip(m_list, mutations):
df.loc[m]=n

AE=['AE1','AE2']
BE=['BE1','BE2']
HE=['HE1','HE2']

"""Get variants in AE and in HE but not in BE"""
AE_HE_notBE = df.loc[((df["AE1"] == 1.0) | (df["AE2"] == 1.0)) & ((df["HE1"] == 1.0) | (df["HE2"] == 1.0)) & ((df["BE1"] != 1.0) & (df["BE2"] != 1.0)) & ((df["BE2"] != 0.5) | (df["BE2"] != 0.5))]

"""Get variants in AE and in BE but not in HE"""
AE_BE_notHE = df.loc[((df["AE1"] == 1.0) | (df["AE2"] == 1.0)) & ((df["BE1"] == 1.0) | (df["BE2"] == 1.0)) & ((df["HE1"] != 1.0) & (df["HE2"] != 1.0)) & ((df["HE2"] != 0.5) | (df["HE2"] != 0.5))]

"""Get variants in BE and in HE but not in AE"""
BE_HE_notAE = df.loc[((df["BE1"] == 1.0) | (df["BE2"] == 1.0)) & ((df["HE1"] == 1.0) | (df["HE2"] == 1.0)) & ((df["AE1"] != 1.0) & (df["AE2"] != 1.0)) & ((df["AE2"] != 0.5) | (df["AE2"] != 0.5))]


This shows an extremely simplified minimal example of the problem. Multiple criteria are being used to select subsets of the df, I would like to perform a type of subset across a whole group of columns and a different type of subset on another group of columns, however this gets very messy when you have more than say 10 columns. A more realistic example is shown in the 1st example, as has been pointed out this is almost unreadable - which is exactly my point - is there a neater way to write this kind of complex query/subset where multiple columns require the same selective operation to be performed on them? I would appreciate any help.










share|improve this question

























  • I've provided a minimal example that can be copied and pasted into a python terminal. The code being almost non readable is exactly my point - I'm hopeful that someone else has come across this problem before and knows a more effective and readable way to write this kind of complex subset

    – user3062260
    Nov 15 '18 at 13:26













  • Consider checking this link and also query.

    – Sandeep Kadapa
    Nov 15 '18 at 13:45
















1















I'm selecting data from a pandas dataframe which is ~1.5M rows by 22 columns. Each column is a sample and each row is an observation about a mutation. 1.0 means the sample has a mutation, 0.0 means the sample does not have the specific mutation and 0.5 means there is no data on that sample for that mutation.



The samples are from one of 3 tissue types, lets call the tissues AE, BE and HE. The samples fit into the categories as follows:



AE=["X14AE","X15AE","X22AE","X23AE","AE21.35","AE36.45","AE46.55","AE61.80",]
BE=["X161724BE","BE1.2","BE1.8","BE2","BE9.13"]
HE=["X11HE","X18HE","HE17.24","HE2.4.5.6","HE8.15","HE8.9"]


I've made the following queries with pandas, they all work but seem very clunky:



-Get variants in AE and in HE but not in BE
-Get variants in AE and in BE but not in HE
-Get variants in BE and in HE but not in AE


Here is the code:



    """Get variants in AE and in HE but not in BE"""
AE_HE_notBE = df.loc[((df["X14AE"] == 1.0) | (df["X15AE"] == 1.0) | (df["X22AE"] == 1.0) |
(df["X23AE"] == 1.0) | (df["AE21.35"] == 1.0) | (df["AE36.45"] == 1.0) | (df["AE61.80"] == 1.0)) &
((df["X11HE"] == 1.0) | (df["X18HE"] == 1.0) |(df["HE17.24"] == 1.0) |(df["HE2.4.5.6"] == 1.0) |
(df["HE8.15"] == 1.0) | (df["HE8.9"] == 1.0)) & ((df["X161724BE"] != 1.0) & (df["BE1.2"] != 1.0) &
(df["BE1.8"] != 1.0) & (df["BE2"] != 1.0) & (df["BE9.13"] != 1.0)) & ((df["X161724BE"] != 0.5) | (df["BE1.2"] != 0.5) |
(df["BE1.8"] != 0.5) | (df["BE2"] != 0.5) | (df["BE9.13"] != 0.5))]


"""Get variants in AE and in BE but not in HE"""
AE_BE_notHE = df.loc[((df["X14AE"] == 1.0) | (df["X15AE"] == 1.0) | (df["X22AE"] == 1.0) |
(df["X23AE"] == 1.0) | (df["AE21.35"] == 1.0) | (df["AE36.45"] == 1.0) | (df["AE61.80"] == 1.0)) &
((df["X11HE"] != 1.0) & (df["X18HE"] != 1.0) &(df["HE17.24"] != 1.0) & (df["HE2.4.5.6"] != 1.0) &
(df["HE8.15"] != 1.0) & (df["HE8.9"] != 1.0)) &
((df["X11HE"] != 0.5) | (df["X18HE"] != 0.5) |(df["HE17.24"] != 0.5) |(df["HE2.4.5.6"] != 0.5) |
(df["HE8.15"] != 0.5) | (df["HE8.9"] != 0.5)) &
((df["X161724BE"] == 1.0) | (df["BE1.2"] == 1.0) |
(df["BE1.8"] == 1.0) | (df["BE2"] != 1.0) | (df["BE9.13"] == 1.0))]

"""Get variants in BE and in HE but not in AE"""
BE_HE_notAE = df.loc[((df["X161724BE"] == 1.0) | (df["BE1.2"] == 1.0) |
(df["BE1.8"] == 1.0) | (df["BE2"] != 1.0) | (df["BE9.13"] == 1.0)) &
((df["X11HE"] == 1.0) | (df["X18HE"] == 1.0) |(df["HE17.24"] == 1.0) |(df["HE2.4.5.6"] == 1.0) |
(df["HE8.15"] == 1.0) | (df["HE8.9"] == 1.0)) &
((df["X14AE"] != 1.0) & (df["X15AE"] != 1.0) & (df["X22AE"] != 1.0) &
(df["X23AE"] != 1.0) & (df["AE21.35"] != 1.0) & (df["AE36.45"] != 1.0) & (df["AE61.80"] != 1.0)) &
((df["X14AE"] != 0.5) | (df["X15AE"] != 0.5) | (df["X22AE"] != 0.5) |
(df["X23AE"] != 0.5) | (df["AE21.35"] != 0.5) | (df["AE36.45"] != 0.5) | (df["AE61.80"] != 0.5))]


This works fine but seems incredibly clunky and just not very elegant and if I need to change a few things (such as sample names) it is time consuming to re-write, can anyone help me with a simple way to re-write this query? I was wondering if there is a way where I could just pass in each list with a criteria? Something like:



AE_HE_notBE = df.loc[((df.[at least 1 sample from AE_list] == 1.0) & (df.[at least 1 sample from HE_list] == 1.0) & (df.[no sample from BE_list] == 1.0) & (df.[at least 1 sample from BE_list] == 0.0))]


I find I need to subset rows based on multiple columns where the columns can be grouped quite regularly so if anyone can make this kind of query more concise I would really appreciate it. Many thanks



#

EDIT: Minimal example as requested:



mutations=[[1,1,0,0,0.5,0],
[1,0,0,0,1,0],
[1,1,0,0.5,0,0],
[0,0.5,0,1,0,1],
[0,1,0,0,0,0],
[1,0,0,0,0,0],
[1,0,1,0,1,0],
[0,0,0,1,0.5,1],
[0,1,1,1,0,0],
[1,0.5,0,1,0,0]]

import string
import pandas as pd
m_list=[x for x in string.ascii_lowercase[:10]]

df=pd.DataFrame(columns=['AE1','AE2','BE1','BE2','HE1','HE2'])
for m,n in zip(m_list, mutations):
df.loc[m]=n

AE=['AE1','AE2']
BE=['BE1','BE2']
HE=['HE1','HE2']

"""Get variants in AE and in HE but not in BE"""
AE_HE_notBE = df.loc[((df["AE1"] == 1.0) | (df["AE2"] == 1.0)) & ((df["HE1"] == 1.0) | (df["HE2"] == 1.0)) & ((df["BE1"] != 1.0) & (df["BE2"] != 1.0)) & ((df["BE2"] != 0.5) | (df["BE2"] != 0.5))]

"""Get variants in AE and in BE but not in HE"""
AE_BE_notHE = df.loc[((df["AE1"] == 1.0) | (df["AE2"] == 1.0)) & ((df["BE1"] == 1.0) | (df["BE2"] == 1.0)) & ((df["HE1"] != 1.0) & (df["HE2"] != 1.0)) & ((df["HE2"] != 0.5) | (df["HE2"] != 0.5))]

"""Get variants in BE and in HE but not in AE"""
BE_HE_notAE = df.loc[((df["BE1"] == 1.0) | (df["BE2"] == 1.0)) & ((df["HE1"] == 1.0) | (df["HE2"] == 1.0)) & ((df["AE1"] != 1.0) & (df["AE2"] != 1.0)) & ((df["AE2"] != 0.5) | (df["AE2"] != 0.5))]


This shows an extremely simplified minimal example of the problem. Multiple criteria are being used to select subsets of the df, I would like to perform a type of subset across a whole group of columns and a different type of subset on another group of columns, however this gets very messy when you have more than say 10 columns. A more realistic example is shown in the 1st example, as has been pointed out this is almost unreadable - which is exactly my point - is there a neater way to write this kind of complex query/subset where multiple columns require the same selective operation to be performed on them? I would appreciate any help.










share|improve this question

























  • I've provided a minimal example that can be copied and pasted into a python terminal. The code being almost non readable is exactly my point - I'm hopeful that someone else has come across this problem before and knows a more effective and readable way to write this kind of complex subset

    – user3062260
    Nov 15 '18 at 13:26













  • Consider checking this link and also query.

    – Sandeep Kadapa
    Nov 15 '18 at 13:45














1












1








1


1






I'm selecting data from a pandas dataframe which is ~1.5M rows by 22 columns. Each column is a sample and each row is an observation about a mutation. 1.0 means the sample has a mutation, 0.0 means the sample does not have the specific mutation and 0.5 means there is no data on that sample for that mutation.



The samples are from one of 3 tissue types, lets call the tissues AE, BE and HE. The samples fit into the categories as follows:



AE=["X14AE","X15AE","X22AE","X23AE","AE21.35","AE36.45","AE46.55","AE61.80",]
BE=["X161724BE","BE1.2","BE1.8","BE2","BE9.13"]
HE=["X11HE","X18HE","HE17.24","HE2.4.5.6","HE8.15","HE8.9"]


I've made the following queries with pandas, they all work but seem very clunky:



-Get variants in AE and in HE but not in BE
-Get variants in AE and in BE but not in HE
-Get variants in BE and in HE but not in AE


Here is the code:



    """Get variants in AE and in HE but not in BE"""
AE_HE_notBE = df.loc[((df["X14AE"] == 1.0) | (df["X15AE"] == 1.0) | (df["X22AE"] == 1.0) |
(df["X23AE"] == 1.0) | (df["AE21.35"] == 1.0) | (df["AE36.45"] == 1.0) | (df["AE61.80"] == 1.0)) &
((df["X11HE"] == 1.0) | (df["X18HE"] == 1.0) |(df["HE17.24"] == 1.0) |(df["HE2.4.5.6"] == 1.0) |
(df["HE8.15"] == 1.0) | (df["HE8.9"] == 1.0)) & ((df["X161724BE"] != 1.0) & (df["BE1.2"] != 1.0) &
(df["BE1.8"] != 1.0) & (df["BE2"] != 1.0) & (df["BE9.13"] != 1.0)) & ((df["X161724BE"] != 0.5) | (df["BE1.2"] != 0.5) |
(df["BE1.8"] != 0.5) | (df["BE2"] != 0.5) | (df["BE9.13"] != 0.5))]


"""Get variants in AE and in BE but not in HE"""
AE_BE_notHE = df.loc[((df["X14AE"] == 1.0) | (df["X15AE"] == 1.0) | (df["X22AE"] == 1.0) |
(df["X23AE"] == 1.0) | (df["AE21.35"] == 1.0) | (df["AE36.45"] == 1.0) | (df["AE61.80"] == 1.0)) &
((df["X11HE"] != 1.0) & (df["X18HE"] != 1.0) &(df["HE17.24"] != 1.0) & (df["HE2.4.5.6"] != 1.0) &
(df["HE8.15"] != 1.0) & (df["HE8.9"] != 1.0)) &
((df["X11HE"] != 0.5) | (df["X18HE"] != 0.5) |(df["HE17.24"] != 0.5) |(df["HE2.4.5.6"] != 0.5) |
(df["HE8.15"] != 0.5) | (df["HE8.9"] != 0.5)) &
((df["X161724BE"] == 1.0) | (df["BE1.2"] == 1.0) |
(df["BE1.8"] == 1.0) | (df["BE2"] != 1.0) | (df["BE9.13"] == 1.0))]

"""Get variants in BE and in HE but not in AE"""
BE_HE_notAE = df.loc[((df["X161724BE"] == 1.0) | (df["BE1.2"] == 1.0) |
(df["BE1.8"] == 1.0) | (df["BE2"] != 1.0) | (df["BE9.13"] == 1.0)) &
((df["X11HE"] == 1.0) | (df["X18HE"] == 1.0) |(df["HE17.24"] == 1.0) |(df["HE2.4.5.6"] == 1.0) |
(df["HE8.15"] == 1.0) | (df["HE8.9"] == 1.0)) &
((df["X14AE"] != 1.0) & (df["X15AE"] != 1.0) & (df["X22AE"] != 1.0) &
(df["X23AE"] != 1.0) & (df["AE21.35"] != 1.0) & (df["AE36.45"] != 1.0) & (df["AE61.80"] != 1.0)) &
((df["X14AE"] != 0.5) | (df["X15AE"] != 0.5) | (df["X22AE"] != 0.5) |
(df["X23AE"] != 0.5) | (df["AE21.35"] != 0.5) | (df["AE36.45"] != 0.5) | (df["AE61.80"] != 0.5))]


This works fine but seems incredibly clunky and just not very elegant and if I need to change a few things (such as sample names) it is time consuming to re-write, can anyone help me with a simple way to re-write this query? I was wondering if there is a way where I could just pass in each list with a criteria? Something like:



AE_HE_notBE = df.loc[((df.[at least 1 sample from AE_list] == 1.0) & (df.[at least 1 sample from HE_list] == 1.0) & (df.[no sample from BE_list] == 1.0) & (df.[at least 1 sample from BE_list] == 0.0))]


I find I need to subset rows based on multiple columns where the columns can be grouped quite regularly so if anyone can make this kind of query more concise I would really appreciate it. Many thanks



#

EDIT: Minimal example as requested:



mutations=[[1,1,0,0,0.5,0],
[1,0,0,0,1,0],
[1,1,0,0.5,0,0],
[0,0.5,0,1,0,1],
[0,1,0,0,0,0],
[1,0,0,0,0,0],
[1,0,1,0,1,0],
[0,0,0,1,0.5,1],
[0,1,1,1,0,0],
[1,0.5,0,1,0,0]]

import string
import pandas as pd
m_list=[x for x in string.ascii_lowercase[:10]]

df=pd.DataFrame(columns=['AE1','AE2','BE1','BE2','HE1','HE2'])
for m,n in zip(m_list, mutations):
df.loc[m]=n

AE=['AE1','AE2']
BE=['BE1','BE2']
HE=['HE1','HE2']

"""Get variants in AE and in HE but not in BE"""
AE_HE_notBE = df.loc[((df["AE1"] == 1.0) | (df["AE2"] == 1.0)) & ((df["HE1"] == 1.0) | (df["HE2"] == 1.0)) & ((df["BE1"] != 1.0) & (df["BE2"] != 1.0)) & ((df["BE2"] != 0.5) | (df["BE2"] != 0.5))]

"""Get variants in AE and in BE but not in HE"""
AE_BE_notHE = df.loc[((df["AE1"] == 1.0) | (df["AE2"] == 1.0)) & ((df["BE1"] == 1.0) | (df["BE2"] == 1.0)) & ((df["HE1"] != 1.0) & (df["HE2"] != 1.0)) & ((df["HE2"] != 0.5) | (df["HE2"] != 0.5))]

"""Get variants in BE and in HE but not in AE"""
BE_HE_notAE = df.loc[((df["BE1"] == 1.0) | (df["BE2"] == 1.0)) & ((df["HE1"] == 1.0) | (df["HE2"] == 1.0)) & ((df["AE1"] != 1.0) & (df["AE2"] != 1.0)) & ((df["AE2"] != 0.5) | (df["AE2"] != 0.5))]


This shows an extremely simplified minimal example of the problem. Multiple criteria are being used to select subsets of the df, I would like to perform a type of subset across a whole group of columns and a different type of subset on another group of columns, however this gets very messy when you have more than say 10 columns. A more realistic example is shown in the 1st example, as has been pointed out this is almost unreadable - which is exactly my point - is there a neater way to write this kind of complex query/subset where multiple columns require the same selective operation to be performed on them? I would appreciate any help.










share|improve this question
















I'm selecting data from a pandas dataframe which is ~1.5M rows by 22 columns. Each column is a sample and each row is an observation about a mutation. 1.0 means the sample has a mutation, 0.0 means the sample does not have the specific mutation and 0.5 means there is no data on that sample for that mutation.



The samples are from one of 3 tissue types, lets call the tissues AE, BE and HE. The samples fit into the categories as follows:



AE=["X14AE","X15AE","X22AE","X23AE","AE21.35","AE36.45","AE46.55","AE61.80",]
BE=["X161724BE","BE1.2","BE1.8","BE2","BE9.13"]
HE=["X11HE","X18HE","HE17.24","HE2.4.5.6","HE8.15","HE8.9"]


I've made the following queries with pandas, they all work but seem very clunky:



-Get variants in AE and in HE but not in BE
-Get variants in AE and in BE but not in HE
-Get variants in BE and in HE but not in AE


Here is the code:



    """Get variants in AE and in HE but not in BE"""
AE_HE_notBE = df.loc[((df["X14AE"] == 1.0) | (df["X15AE"] == 1.0) | (df["X22AE"] == 1.0) |
(df["X23AE"] == 1.0) | (df["AE21.35"] == 1.0) | (df["AE36.45"] == 1.0) | (df["AE61.80"] == 1.0)) &
((df["X11HE"] == 1.0) | (df["X18HE"] == 1.0) |(df["HE17.24"] == 1.0) |(df["HE2.4.5.6"] == 1.0) |
(df["HE8.15"] == 1.0) | (df["HE8.9"] == 1.0)) & ((df["X161724BE"] != 1.0) & (df["BE1.2"] != 1.0) &
(df["BE1.8"] != 1.0) & (df["BE2"] != 1.0) & (df["BE9.13"] != 1.0)) & ((df["X161724BE"] != 0.5) | (df["BE1.2"] != 0.5) |
(df["BE1.8"] != 0.5) | (df["BE2"] != 0.5) | (df["BE9.13"] != 0.5))]


"""Get variants in AE and in BE but not in HE"""
AE_BE_notHE = df.loc[((df["X14AE"] == 1.0) | (df["X15AE"] == 1.0) | (df["X22AE"] == 1.0) |
(df["X23AE"] == 1.0) | (df["AE21.35"] == 1.0) | (df["AE36.45"] == 1.0) | (df["AE61.80"] == 1.0)) &
((df["X11HE"] != 1.0) & (df["X18HE"] != 1.0) &(df["HE17.24"] != 1.0) & (df["HE2.4.5.6"] != 1.0) &
(df["HE8.15"] != 1.0) & (df["HE8.9"] != 1.0)) &
((df["X11HE"] != 0.5) | (df["X18HE"] != 0.5) |(df["HE17.24"] != 0.5) |(df["HE2.4.5.6"] != 0.5) |
(df["HE8.15"] != 0.5) | (df["HE8.9"] != 0.5)) &
((df["X161724BE"] == 1.0) | (df["BE1.2"] == 1.0) |
(df["BE1.8"] == 1.0) | (df["BE2"] != 1.0) | (df["BE9.13"] == 1.0))]

"""Get variants in BE and in HE but not in AE"""
BE_HE_notAE = df.loc[((df["X161724BE"] == 1.0) | (df["BE1.2"] == 1.0) |
(df["BE1.8"] == 1.0) | (df["BE2"] != 1.0) | (df["BE9.13"] == 1.0)) &
((df["X11HE"] == 1.0) | (df["X18HE"] == 1.0) |(df["HE17.24"] == 1.0) |(df["HE2.4.5.6"] == 1.0) |
(df["HE8.15"] == 1.0) | (df["HE8.9"] == 1.0)) &
((df["X14AE"] != 1.0) & (df["X15AE"] != 1.0) & (df["X22AE"] != 1.0) &
(df["X23AE"] != 1.0) & (df["AE21.35"] != 1.0) & (df["AE36.45"] != 1.0) & (df["AE61.80"] != 1.0)) &
((df["X14AE"] != 0.5) | (df["X15AE"] != 0.5) | (df["X22AE"] != 0.5) |
(df["X23AE"] != 0.5) | (df["AE21.35"] != 0.5) | (df["AE36.45"] != 0.5) | (df["AE61.80"] != 0.5))]


This works fine but seems incredibly clunky and just not very elegant and if I need to change a few things (such as sample names) it is time consuming to re-write, can anyone help me with a simple way to re-write this query? I was wondering if there is a way where I could just pass in each list with a criteria? Something like:



AE_HE_notBE = df.loc[((df.[at least 1 sample from AE_list] == 1.0) & (df.[at least 1 sample from HE_list] == 1.0) & (df.[no sample from BE_list] == 1.0) & (df.[at least 1 sample from BE_list] == 0.0))]


I find I need to subset rows based on multiple columns where the columns can be grouped quite regularly so if anyone can make this kind of query more concise I would really appreciate it. Many thanks



#

EDIT: Minimal example as requested:



mutations=[[1,1,0,0,0.5,0],
[1,0,0,0,1,0],
[1,1,0,0.5,0,0],
[0,0.5,0,1,0,1],
[0,1,0,0,0,0],
[1,0,0,0,0,0],
[1,0,1,0,1,0],
[0,0,0,1,0.5,1],
[0,1,1,1,0,0],
[1,0.5,0,1,0,0]]

import string
import pandas as pd
m_list=[x for x in string.ascii_lowercase[:10]]

df=pd.DataFrame(columns=['AE1','AE2','BE1','BE2','HE1','HE2'])
for m,n in zip(m_list, mutations):
df.loc[m]=n

AE=['AE1','AE2']
BE=['BE1','BE2']
HE=['HE1','HE2']

"""Get variants in AE and in HE but not in BE"""
AE_HE_notBE = df.loc[((df["AE1"] == 1.0) | (df["AE2"] == 1.0)) & ((df["HE1"] == 1.0) | (df["HE2"] == 1.0)) & ((df["BE1"] != 1.0) & (df["BE2"] != 1.0)) & ((df["BE2"] != 0.5) | (df["BE2"] != 0.5))]

"""Get variants in AE and in BE but not in HE"""
AE_BE_notHE = df.loc[((df["AE1"] == 1.0) | (df["AE2"] == 1.0)) & ((df["BE1"] == 1.0) | (df["BE2"] == 1.0)) & ((df["HE1"] != 1.0) & (df["HE2"] != 1.0)) & ((df["HE2"] != 0.5) | (df["HE2"] != 0.5))]

"""Get variants in BE and in HE but not in AE"""
BE_HE_notAE = df.loc[((df["BE1"] == 1.0) | (df["BE2"] == 1.0)) & ((df["HE1"] == 1.0) | (df["HE2"] == 1.0)) & ((df["AE1"] != 1.0) & (df["AE2"] != 1.0)) & ((df["AE2"] != 0.5) | (df["AE2"] != 0.5))]


This shows an extremely simplified minimal example of the problem. Multiple criteria are being used to select subsets of the df, I would like to perform a type of subset across a whole group of columns and a different type of subset on another group of columns, however this gets very messy when you have more than say 10 columns. A more realistic example is shown in the 1st example, as has been pointed out this is almost unreadable - which is exactly my point - is there a neater way to write this kind of complex query/subset where multiple columns require the same selective operation to be performed on them? I would appreciate any help.







python pandas






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 15 '18 at 13:23







user3062260

















asked Nov 15 '18 at 12:21









user3062260user3062260

421421




421421













  • I've provided a minimal example that can be copied and pasted into a python terminal. The code being almost non readable is exactly my point - I'm hopeful that someone else has come across this problem before and knows a more effective and readable way to write this kind of complex subset

    – user3062260
    Nov 15 '18 at 13:26













  • Consider checking this link and also query.

    – Sandeep Kadapa
    Nov 15 '18 at 13:45



















  • I've provided a minimal example that can be copied and pasted into a python terminal. The code being almost non readable is exactly my point - I'm hopeful that someone else has come across this problem before and knows a more effective and readable way to write this kind of complex subset

    – user3062260
    Nov 15 '18 at 13:26













  • Consider checking this link and also query.

    – Sandeep Kadapa
    Nov 15 '18 at 13:45

















I've provided a minimal example that can be copied and pasted into a python terminal. The code being almost non readable is exactly my point - I'm hopeful that someone else has come across this problem before and knows a more effective and readable way to write this kind of complex subset

– user3062260
Nov 15 '18 at 13:26







I've provided a minimal example that can be copied and pasted into a python terminal. The code being almost non readable is exactly my point - I'm hopeful that someone else has come across this problem before and knows a more effective and readable way to write this kind of complex subset

– user3062260
Nov 15 '18 at 13:26















Consider checking this link and also query.

– Sandeep Kadapa
Nov 15 '18 at 13:45





Consider checking this link and also query.

– Sandeep Kadapa
Nov 15 '18 at 13:45












1 Answer
1






active

oldest

votes


















1















eq + any / all + loc



Vectorised, you can subset your dataframe and use equality and any / all operations:



# Get variants in AE and in HE but not in BE

m1 = df[AE].eq(1.0).any(1)
m2 = df[HE].eq(1.0).any(1)
m3 = df[BE].eq(0).all(1)

df_filtered = df.loc[m1 & m2 & m3]


If, as you describe, all your values are 0, 0.5 or 1.0, saying selected values cannot be 1.0 or 0.5 is the same as requiring them to be 0.






share|improve this answer


























  • This is a really great and concise piece of code - exactly what I'm looking for! My only question is: given that the 0.5 values are not a problem, as long as 'not all' columns in a given group such as 'AE' are 0.5. How can I use your code to build a subset where for example: in AE at least one col=1, in HE at least 1 col=1 and in BE no col can =1, but 1 BE must =0, however many cols are 0.5 doesn't matter as long as the right allocation of 1 and 0 is fullfilled.

    – user3062260
    Nov 15 '18 at 14:43











  • for the BE condition you can use something like df[BE].ne(1).any(1) & df[BE].eq(0).any(1); ne stands for "not equal".

    – jpp
    Nov 15 '18 at 14:45








  • 1





    This is perfect! Thanks for your help!

    – user3062260
    Nov 16 '18 at 11:47











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53319422%2fcomplex-pandas-sub-setting-selecting-rows-that-match-criteria-in-many-columns%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1















eq + any / all + loc



Vectorised, you can subset your dataframe and use equality and any / all operations:



# Get variants in AE and in HE but not in BE

m1 = df[AE].eq(1.0).any(1)
m2 = df[HE].eq(1.0).any(1)
m3 = df[BE].eq(0).all(1)

df_filtered = df.loc[m1 & m2 & m3]


If, as you describe, all your values are 0, 0.5 or 1.0, saying selected values cannot be 1.0 or 0.5 is the same as requiring them to be 0.






share|improve this answer


























  • This is a really great and concise piece of code - exactly what I'm looking for! My only question is: given that the 0.5 values are not a problem, as long as 'not all' columns in a given group such as 'AE' are 0.5. How can I use your code to build a subset where for example: in AE at least one col=1, in HE at least 1 col=1 and in BE no col can =1, but 1 BE must =0, however many cols are 0.5 doesn't matter as long as the right allocation of 1 and 0 is fullfilled.

    – user3062260
    Nov 15 '18 at 14:43











  • for the BE condition you can use something like df[BE].ne(1).any(1) & df[BE].eq(0).any(1); ne stands for "not equal".

    – jpp
    Nov 15 '18 at 14:45








  • 1





    This is perfect! Thanks for your help!

    – user3062260
    Nov 16 '18 at 11:47
















1















eq + any / all + loc



Vectorised, you can subset your dataframe and use equality and any / all operations:



# Get variants in AE and in HE but not in BE

m1 = df[AE].eq(1.0).any(1)
m2 = df[HE].eq(1.0).any(1)
m3 = df[BE].eq(0).all(1)

df_filtered = df.loc[m1 & m2 & m3]


If, as you describe, all your values are 0, 0.5 or 1.0, saying selected values cannot be 1.0 or 0.5 is the same as requiring them to be 0.






share|improve this answer


























  • This is a really great and concise piece of code - exactly what I'm looking for! My only question is: given that the 0.5 values are not a problem, as long as 'not all' columns in a given group such as 'AE' are 0.5. How can I use your code to build a subset where for example: in AE at least one col=1, in HE at least 1 col=1 and in BE no col can =1, but 1 BE must =0, however many cols are 0.5 doesn't matter as long as the right allocation of 1 and 0 is fullfilled.

    – user3062260
    Nov 15 '18 at 14:43











  • for the BE condition you can use something like df[BE].ne(1).any(1) & df[BE].eq(0).any(1); ne stands for "not equal".

    – jpp
    Nov 15 '18 at 14:45








  • 1





    This is perfect! Thanks for your help!

    – user3062260
    Nov 16 '18 at 11:47














1












1








1








eq + any / all + loc



Vectorised, you can subset your dataframe and use equality and any / all operations:



# Get variants in AE and in HE but not in BE

m1 = df[AE].eq(1.0).any(1)
m2 = df[HE].eq(1.0).any(1)
m3 = df[BE].eq(0).all(1)

df_filtered = df.loc[m1 & m2 & m3]


If, as you describe, all your values are 0, 0.5 or 1.0, saying selected values cannot be 1.0 or 0.5 is the same as requiring them to be 0.






share|improve this answer
















eq + any / all + loc



Vectorised, you can subset your dataframe and use equality and any / all operations:



# Get variants in AE and in HE but not in BE

m1 = df[AE].eq(1.0).any(1)
m2 = df[HE].eq(1.0).any(1)
m3 = df[BE].eq(0).all(1)

df_filtered = df.loc[m1 & m2 & m3]


If, as you describe, all your values are 0, 0.5 or 1.0, saying selected values cannot be 1.0 or 0.5 is the same as requiring them to be 0.







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 15 '18 at 14:15

























answered Nov 15 '18 at 14:10









jppjpp

101k2164115




101k2164115













  • This is a really great and concise piece of code - exactly what I'm looking for! My only question is: given that the 0.5 values are not a problem, as long as 'not all' columns in a given group such as 'AE' are 0.5. How can I use your code to build a subset where for example: in AE at least one col=1, in HE at least 1 col=1 and in BE no col can =1, but 1 BE must =0, however many cols are 0.5 doesn't matter as long as the right allocation of 1 and 0 is fullfilled.

    – user3062260
    Nov 15 '18 at 14:43











  • for the BE condition you can use something like df[BE].ne(1).any(1) & df[BE].eq(0).any(1); ne stands for "not equal".

    – jpp
    Nov 15 '18 at 14:45








  • 1





    This is perfect! Thanks for your help!

    – user3062260
    Nov 16 '18 at 11:47



















  • This is a really great and concise piece of code - exactly what I'm looking for! My only question is: given that the 0.5 values are not a problem, as long as 'not all' columns in a given group such as 'AE' are 0.5. How can I use your code to build a subset where for example: in AE at least one col=1, in HE at least 1 col=1 and in BE no col can =1, but 1 BE must =0, however many cols are 0.5 doesn't matter as long as the right allocation of 1 and 0 is fullfilled.

    – user3062260
    Nov 15 '18 at 14:43











  • for the BE condition you can use something like df[BE].ne(1).any(1) & df[BE].eq(0).any(1); ne stands for "not equal".

    – jpp
    Nov 15 '18 at 14:45








  • 1





    This is perfect! Thanks for your help!

    – user3062260
    Nov 16 '18 at 11:47

















This is a really great and concise piece of code - exactly what I'm looking for! My only question is: given that the 0.5 values are not a problem, as long as 'not all' columns in a given group such as 'AE' are 0.5. How can I use your code to build a subset where for example: in AE at least one col=1, in HE at least 1 col=1 and in BE no col can =1, but 1 BE must =0, however many cols are 0.5 doesn't matter as long as the right allocation of 1 and 0 is fullfilled.

– user3062260
Nov 15 '18 at 14:43





This is a really great and concise piece of code - exactly what I'm looking for! My only question is: given that the 0.5 values are not a problem, as long as 'not all' columns in a given group such as 'AE' are 0.5. How can I use your code to build a subset where for example: in AE at least one col=1, in HE at least 1 col=1 and in BE no col can =1, but 1 BE must =0, however many cols are 0.5 doesn't matter as long as the right allocation of 1 and 0 is fullfilled.

– user3062260
Nov 15 '18 at 14:43













for the BE condition you can use something like df[BE].ne(1).any(1) & df[BE].eq(0).any(1); ne stands for "not equal".

– jpp
Nov 15 '18 at 14:45







for the BE condition you can use something like df[BE].ne(1).any(1) & df[BE].eq(0).any(1); ne stands for "not equal".

– jpp
Nov 15 '18 at 14:45






1




1





This is perfect! Thanks for your help!

– user3062260
Nov 16 '18 at 11:47





This is perfect! Thanks for your help!

– user3062260
Nov 16 '18 at 11:47




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53319422%2fcomplex-pandas-sub-setting-selecting-rows-that-match-criteria-in-many-columns%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Florida Star v. B. J. F.

Danny Elfman

Lugert, Oklahoma