How to split a column into two columns by first and last found pattern in Pandas (Python 3.x)
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
i have a problem with splitting a column into two columns. I want to split the column by the first and last found pattern '-'. Maybe this is trivial.
Here is my column:
col1
0 aa-bb-cc-dd
1 aa-bb-cc
2 aa-bb-cc
3 aa-bb-cc-dd
This is the frame i want as result:
col1 col2
0 bb-cc dd
1 bb cc
2 bb cc
3 bb-cc dd
Thanks in advance!
python string pandas dataframe split
add a comment |
i have a problem with splitting a column into two columns. I want to split the column by the first and last found pattern '-'. Maybe this is trivial.
Here is my column:
col1
0 aa-bb-cc-dd
1 aa-bb-cc
2 aa-bb-cc
3 aa-bb-cc-dd
This is the frame i want as result:
col1 col2
0 bb-cc dd
1 bb cc
2 bb cc
3 bb-cc dd
Thanks in advance!
python string pandas dataframe split
add a comment |
i have a problem with splitting a column into two columns. I want to split the column by the first and last found pattern '-'. Maybe this is trivial.
Here is my column:
col1
0 aa-bb-cc-dd
1 aa-bb-cc
2 aa-bb-cc
3 aa-bb-cc-dd
This is the frame i want as result:
col1 col2
0 bb-cc dd
1 bb cc
2 bb cc
3 bb-cc dd
Thanks in advance!
python string pandas dataframe split
i have a problem with splitting a column into two columns. I want to split the column by the first and last found pattern '-'. Maybe this is trivial.
Here is my column:
col1
0 aa-bb-cc-dd
1 aa-bb-cc
2 aa-bb-cc
3 aa-bb-cc-dd
This is the frame i want as result:
col1 col2
0 bb-cc dd
1 bb cc
2 bb cc
3 bb-cc dd
Thanks in advance!
python string pandas dataframe split
python string pandas dataframe split
edited Nov 16 '18 at 12:05
jpp
102k2166116
102k2166116
asked Nov 16 '18 at 11:26
Michael GannMichael Gann
554
554
add a comment |
add a comment |
5 Answers
5
active
oldest
votes
You can use a list comprehension:
df = pd.DataFrame([i.split('-', 1)[1].rsplit('-', 1) for i in df['col1']],
columns=['col1', 'col2'])
print(df)
col1 col2
0 bb-cc dd
1 bb cc
2 bb cc
3 bb-cc dd
Pandas str
methods exist primarily for convenience. For clean data, you may find the list comprehension more efficient for larger dataframes.
1
This solution takes 1.1 seconds for 700k rows--the fastest I've tested so far! 3x faster than using Series.str.split().
– John Zwinck
Nov 16 '18 at 11:47
add a comment |
If I understand well your question, you need to get rid of the first block delimited by a '-', then split the last '-' block in col2. If that is what you need, you could consider this:
df= pd.DataFrame({'col1':['aa-bb-cc-dd', 'aa-bb-cc', 'aa-bb-cc', 'aa-bb-cc-dd']})
df['col2'] = df['col1'].apply(lambda x: x[x.rfind('-')+1:])
df['col1'] = df['col1'].apply(lambda x: x[x.find('-')+1:x.rfind('-')])
print (df)
apply()
is usually quite slow. If we want to iterate we can just write for loops.
– John Zwinck
Nov 16 '18 at 11:37
This solution takes 1.8 seconds for 700k rows.
– John Zwinck
Nov 16 '18 at 11:42
add a comment |
First slice and use str.rsplit
and rename
:
df = df.col1.str[3:].str.rsplit('-', n=1, expand=True).rename(columns={0:'col1',1:'col2'})
print(df)
col1 col2
0 bb-cc dd
1 bb cc
2 bb cc
3 bb-cc dd
dict(zip([0,1],['col1','col2']))
is just{0: 'col1', 1: 'col2'}
.
– John Zwinck
Nov 16 '18 at 11:35
@JohnZwinck I thought to create dict dynamically but here only 2 columns not needed and thank you.
– Sandeep Kadapa
Nov 16 '18 at 11:36
If you needed to do it dynamically,zip([0,1],['col1','col2'])
is better written asenumerate(['col1', 'col2'])
.
– John Zwinck
Nov 16 '18 at 11:38
@JohnZwinck It's good, I thought to userange
but I will useenumerate
trick from now.
– Sandeep Kadapa
Nov 16 '18 at 11:41
This solution takes 3.9 seconds for 700k rows.
– John Zwinck
Nov 16 '18 at 11:43
|
show 1 more comment
Here's an idiomatic but slow way to do it:
df.col1 = df.col1.str.split('-', 1).str[1] # discard first part
parts = df.col1.str.rsplit('-', 1).str
df.col1 = parts[0]
df['col2'] = parts[1]
While this works, it is not fast: about 4 seconds for 700k rows. Looking at it you'd think this is a good way to do it, but performance-wise it's worse than all the alternatives.
add a comment |
This might help:
df['col2'] = df['col1'].split('-')[-1]
df['col1'] = '-'.join(i for i in df['col1'].split('-')[1:-1])
1
You're missing somestr
s in there.
– John Zwinck
Nov 16 '18 at 11:49
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53336946%2fhow-to-split-a-column-into-two-columns-by-first-and-last-found-pattern-in-pandas%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
5 Answers
5
active
oldest
votes
5 Answers
5
active
oldest
votes
active
oldest
votes
active
oldest
votes
You can use a list comprehension:
df = pd.DataFrame([i.split('-', 1)[1].rsplit('-', 1) for i in df['col1']],
columns=['col1', 'col2'])
print(df)
col1 col2
0 bb-cc dd
1 bb cc
2 bb cc
3 bb-cc dd
Pandas str
methods exist primarily for convenience. For clean data, you may find the list comprehension more efficient for larger dataframes.
1
This solution takes 1.1 seconds for 700k rows--the fastest I've tested so far! 3x faster than using Series.str.split().
– John Zwinck
Nov 16 '18 at 11:47
add a comment |
You can use a list comprehension:
df = pd.DataFrame([i.split('-', 1)[1].rsplit('-', 1) for i in df['col1']],
columns=['col1', 'col2'])
print(df)
col1 col2
0 bb-cc dd
1 bb cc
2 bb cc
3 bb-cc dd
Pandas str
methods exist primarily for convenience. For clean data, you may find the list comprehension more efficient for larger dataframes.
1
This solution takes 1.1 seconds for 700k rows--the fastest I've tested so far! 3x faster than using Series.str.split().
– John Zwinck
Nov 16 '18 at 11:47
add a comment |
You can use a list comprehension:
df = pd.DataFrame([i.split('-', 1)[1].rsplit('-', 1) for i in df['col1']],
columns=['col1', 'col2'])
print(df)
col1 col2
0 bb-cc dd
1 bb cc
2 bb cc
3 bb-cc dd
Pandas str
methods exist primarily for convenience. For clean data, you may find the list comprehension more efficient for larger dataframes.
You can use a list comprehension:
df = pd.DataFrame([i.split('-', 1)[1].rsplit('-', 1) for i in df['col1']],
columns=['col1', 'col2'])
print(df)
col1 col2
0 bb-cc dd
1 bb cc
2 bb cc
3 bb-cc dd
Pandas str
methods exist primarily for convenience. For clean data, you may find the list comprehension more efficient for larger dataframes.
answered Nov 16 '18 at 11:37
jppjpp
102k2166116
102k2166116
1
This solution takes 1.1 seconds for 700k rows--the fastest I've tested so far! 3x faster than using Series.str.split().
– John Zwinck
Nov 16 '18 at 11:47
add a comment |
1
This solution takes 1.1 seconds for 700k rows--the fastest I've tested so far! 3x faster than using Series.str.split().
– John Zwinck
Nov 16 '18 at 11:47
1
1
This solution takes 1.1 seconds for 700k rows--the fastest I've tested so far! 3x faster than using Series.str.split().
– John Zwinck
Nov 16 '18 at 11:47
This solution takes 1.1 seconds for 700k rows--the fastest I've tested so far! 3x faster than using Series.str.split().
– John Zwinck
Nov 16 '18 at 11:47
add a comment |
If I understand well your question, you need to get rid of the first block delimited by a '-', then split the last '-' block in col2. If that is what you need, you could consider this:
df= pd.DataFrame({'col1':['aa-bb-cc-dd', 'aa-bb-cc', 'aa-bb-cc', 'aa-bb-cc-dd']})
df['col2'] = df['col1'].apply(lambda x: x[x.rfind('-')+1:])
df['col1'] = df['col1'].apply(lambda x: x[x.find('-')+1:x.rfind('-')])
print (df)
apply()
is usually quite slow. If we want to iterate we can just write for loops.
– John Zwinck
Nov 16 '18 at 11:37
This solution takes 1.8 seconds for 700k rows.
– John Zwinck
Nov 16 '18 at 11:42
add a comment |
If I understand well your question, you need to get rid of the first block delimited by a '-', then split the last '-' block in col2. If that is what you need, you could consider this:
df= pd.DataFrame({'col1':['aa-bb-cc-dd', 'aa-bb-cc', 'aa-bb-cc', 'aa-bb-cc-dd']})
df['col2'] = df['col1'].apply(lambda x: x[x.rfind('-')+1:])
df['col1'] = df['col1'].apply(lambda x: x[x.find('-')+1:x.rfind('-')])
print (df)
apply()
is usually quite slow. If we want to iterate we can just write for loops.
– John Zwinck
Nov 16 '18 at 11:37
This solution takes 1.8 seconds for 700k rows.
– John Zwinck
Nov 16 '18 at 11:42
add a comment |
If I understand well your question, you need to get rid of the first block delimited by a '-', then split the last '-' block in col2. If that is what you need, you could consider this:
df= pd.DataFrame({'col1':['aa-bb-cc-dd', 'aa-bb-cc', 'aa-bb-cc', 'aa-bb-cc-dd']})
df['col2'] = df['col1'].apply(lambda x: x[x.rfind('-')+1:])
df['col1'] = df['col1'].apply(lambda x: x[x.find('-')+1:x.rfind('-')])
print (df)
If I understand well your question, you need to get rid of the first block delimited by a '-', then split the last '-' block in col2. If that is what you need, you could consider this:
df= pd.DataFrame({'col1':['aa-bb-cc-dd', 'aa-bb-cc', 'aa-bb-cc', 'aa-bb-cc-dd']})
df['col2'] = df['col1'].apply(lambda x: x[x.rfind('-')+1:])
df['col1'] = df['col1'].apply(lambda x: x[x.find('-')+1:x.rfind('-')])
print (df)
answered Nov 16 '18 at 11:32
Matina GMatina G
629213
629213
apply()
is usually quite slow. If we want to iterate we can just write for loops.
– John Zwinck
Nov 16 '18 at 11:37
This solution takes 1.8 seconds for 700k rows.
– John Zwinck
Nov 16 '18 at 11:42
add a comment |
apply()
is usually quite slow. If we want to iterate we can just write for loops.
– John Zwinck
Nov 16 '18 at 11:37
This solution takes 1.8 seconds for 700k rows.
– John Zwinck
Nov 16 '18 at 11:42
apply()
is usually quite slow. If we want to iterate we can just write for loops.– John Zwinck
Nov 16 '18 at 11:37
apply()
is usually quite slow. If we want to iterate we can just write for loops.– John Zwinck
Nov 16 '18 at 11:37
This solution takes 1.8 seconds for 700k rows.
– John Zwinck
Nov 16 '18 at 11:42
This solution takes 1.8 seconds for 700k rows.
– John Zwinck
Nov 16 '18 at 11:42
add a comment |
First slice and use str.rsplit
and rename
:
df = df.col1.str[3:].str.rsplit('-', n=1, expand=True).rename(columns={0:'col1',1:'col2'})
print(df)
col1 col2
0 bb-cc dd
1 bb cc
2 bb cc
3 bb-cc dd
dict(zip([0,1],['col1','col2']))
is just{0: 'col1', 1: 'col2'}
.
– John Zwinck
Nov 16 '18 at 11:35
@JohnZwinck I thought to create dict dynamically but here only 2 columns not needed and thank you.
– Sandeep Kadapa
Nov 16 '18 at 11:36
If you needed to do it dynamically,zip([0,1],['col1','col2'])
is better written asenumerate(['col1', 'col2'])
.
– John Zwinck
Nov 16 '18 at 11:38
@JohnZwinck It's good, I thought to userange
but I will useenumerate
trick from now.
– Sandeep Kadapa
Nov 16 '18 at 11:41
This solution takes 3.9 seconds for 700k rows.
– John Zwinck
Nov 16 '18 at 11:43
|
show 1 more comment
First slice and use str.rsplit
and rename
:
df = df.col1.str[3:].str.rsplit('-', n=1, expand=True).rename(columns={0:'col1',1:'col2'})
print(df)
col1 col2
0 bb-cc dd
1 bb cc
2 bb cc
3 bb-cc dd
dict(zip([0,1],['col1','col2']))
is just{0: 'col1', 1: 'col2'}
.
– John Zwinck
Nov 16 '18 at 11:35
@JohnZwinck I thought to create dict dynamically but here only 2 columns not needed and thank you.
– Sandeep Kadapa
Nov 16 '18 at 11:36
If you needed to do it dynamically,zip([0,1],['col1','col2'])
is better written asenumerate(['col1', 'col2'])
.
– John Zwinck
Nov 16 '18 at 11:38
@JohnZwinck It's good, I thought to userange
but I will useenumerate
trick from now.
– Sandeep Kadapa
Nov 16 '18 at 11:41
This solution takes 3.9 seconds for 700k rows.
– John Zwinck
Nov 16 '18 at 11:43
|
show 1 more comment
First slice and use str.rsplit
and rename
:
df = df.col1.str[3:].str.rsplit('-', n=1, expand=True).rename(columns={0:'col1',1:'col2'})
print(df)
col1 col2
0 bb-cc dd
1 bb cc
2 bb cc
3 bb-cc dd
First slice and use str.rsplit
and rename
:
df = df.col1.str[3:].str.rsplit('-', n=1, expand=True).rename(columns={0:'col1',1:'col2'})
print(df)
col1 col2
0 bb-cc dd
1 bb cc
2 bb cc
3 bb-cc dd
edited Nov 16 '18 at 11:37
answered Nov 16 '18 at 11:34
Sandeep KadapaSandeep Kadapa
7,408831
7,408831
dict(zip([0,1],['col1','col2']))
is just{0: 'col1', 1: 'col2'}
.
– John Zwinck
Nov 16 '18 at 11:35
@JohnZwinck I thought to create dict dynamically but here only 2 columns not needed and thank you.
– Sandeep Kadapa
Nov 16 '18 at 11:36
If you needed to do it dynamically,zip([0,1],['col1','col2'])
is better written asenumerate(['col1', 'col2'])
.
– John Zwinck
Nov 16 '18 at 11:38
@JohnZwinck It's good, I thought to userange
but I will useenumerate
trick from now.
– Sandeep Kadapa
Nov 16 '18 at 11:41
This solution takes 3.9 seconds for 700k rows.
– John Zwinck
Nov 16 '18 at 11:43
|
show 1 more comment
dict(zip([0,1],['col1','col2']))
is just{0: 'col1', 1: 'col2'}
.
– John Zwinck
Nov 16 '18 at 11:35
@JohnZwinck I thought to create dict dynamically but here only 2 columns not needed and thank you.
– Sandeep Kadapa
Nov 16 '18 at 11:36
If you needed to do it dynamically,zip([0,1],['col1','col2'])
is better written asenumerate(['col1', 'col2'])
.
– John Zwinck
Nov 16 '18 at 11:38
@JohnZwinck It's good, I thought to userange
but I will useenumerate
trick from now.
– Sandeep Kadapa
Nov 16 '18 at 11:41
This solution takes 3.9 seconds for 700k rows.
– John Zwinck
Nov 16 '18 at 11:43
dict(zip([0,1],['col1','col2']))
is just {0: 'col1', 1: 'col2'}
.– John Zwinck
Nov 16 '18 at 11:35
dict(zip([0,1],['col1','col2']))
is just {0: 'col1', 1: 'col2'}
.– John Zwinck
Nov 16 '18 at 11:35
@JohnZwinck I thought to create dict dynamically but here only 2 columns not needed and thank you.
– Sandeep Kadapa
Nov 16 '18 at 11:36
@JohnZwinck I thought to create dict dynamically but here only 2 columns not needed and thank you.
– Sandeep Kadapa
Nov 16 '18 at 11:36
If you needed to do it dynamically,
zip([0,1],['col1','col2'])
is better written as enumerate(['col1', 'col2'])
.– John Zwinck
Nov 16 '18 at 11:38
If you needed to do it dynamically,
zip([0,1],['col1','col2'])
is better written as enumerate(['col1', 'col2'])
.– John Zwinck
Nov 16 '18 at 11:38
@JohnZwinck It's good, I thought to use
range
but I will use enumerate
trick from now.– Sandeep Kadapa
Nov 16 '18 at 11:41
@JohnZwinck It's good, I thought to use
range
but I will use enumerate
trick from now.– Sandeep Kadapa
Nov 16 '18 at 11:41
This solution takes 3.9 seconds for 700k rows.
– John Zwinck
Nov 16 '18 at 11:43
This solution takes 3.9 seconds for 700k rows.
– John Zwinck
Nov 16 '18 at 11:43
|
show 1 more comment
Here's an idiomatic but slow way to do it:
df.col1 = df.col1.str.split('-', 1).str[1] # discard first part
parts = df.col1.str.rsplit('-', 1).str
df.col1 = parts[0]
df['col2'] = parts[1]
While this works, it is not fast: about 4 seconds for 700k rows. Looking at it you'd think this is a good way to do it, but performance-wise it's worse than all the alternatives.
add a comment |
Here's an idiomatic but slow way to do it:
df.col1 = df.col1.str.split('-', 1).str[1] # discard first part
parts = df.col1.str.rsplit('-', 1).str
df.col1 = parts[0]
df['col2'] = parts[1]
While this works, it is not fast: about 4 seconds for 700k rows. Looking at it you'd think this is a good way to do it, but performance-wise it's worse than all the alternatives.
add a comment |
Here's an idiomatic but slow way to do it:
df.col1 = df.col1.str.split('-', 1).str[1] # discard first part
parts = df.col1.str.rsplit('-', 1).str
df.col1 = parts[0]
df['col2'] = parts[1]
While this works, it is not fast: about 4 seconds for 700k rows. Looking at it you'd think this is a good way to do it, but performance-wise it's worse than all the alternatives.
Here's an idiomatic but slow way to do it:
df.col1 = df.col1.str.split('-', 1).str[1] # discard first part
parts = df.col1.str.rsplit('-', 1).str
df.col1 = parts[0]
df['col2'] = parts[1]
While this works, it is not fast: about 4 seconds for 700k rows. Looking at it you'd think this is a good way to do it, but performance-wise it's worse than all the alternatives.
edited Nov 16 '18 at 11:52
answered Nov 16 '18 at 11:35
John ZwinckJohn Zwinck
155k17180298
155k17180298
add a comment |
add a comment |
This might help:
df['col2'] = df['col1'].split('-')[-1]
df['col1'] = '-'.join(i for i in df['col1'].split('-')[1:-1])
1
You're missing somestr
s in there.
– John Zwinck
Nov 16 '18 at 11:49
add a comment |
This might help:
df['col2'] = df['col1'].split('-')[-1]
df['col1'] = '-'.join(i for i in df['col1'].split('-')[1:-1])
1
You're missing somestr
s in there.
– John Zwinck
Nov 16 '18 at 11:49
add a comment |
This might help:
df['col2'] = df['col1'].split('-')[-1]
df['col1'] = '-'.join(i for i in df['col1'].split('-')[1:-1])
This might help:
df['col2'] = df['col1'].split('-')[-1]
df['col1'] = '-'.join(i for i in df['col1'].split('-')[1:-1])
answered Nov 16 '18 at 11:41
specbugspecbug
310310
310310
1
You're missing somestr
s in there.
– John Zwinck
Nov 16 '18 at 11:49
add a comment |
1
You're missing somestr
s in there.
– John Zwinck
Nov 16 '18 at 11:49
1
1
You're missing some
str
s in there.– John Zwinck
Nov 16 '18 at 11:49
You're missing some
str
s in there.– John Zwinck
Nov 16 '18 at 11:49
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53336946%2fhow-to-split-a-column-into-two-columns-by-first-and-last-found-pattern-in-pandas%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown