Pandas - Pipe Delimiter exists in Field Value causing Bad Lines
I am using Pandas to import a text file like so:
data = pd.read_csv('filepath.txt', sep='|', quoting=3,
error_bad_lines=False, encoding='latin1', low_memory=False)
I'm getting an error on 1 line because the field value has a Pipe found within it. When it attempts to parse the row, it finds that the length of the row is too long throwing an error. It allows the file to process; however, this row is missing.
Example:
Row -
4321|Test|1/2/1900
1234|Test||1/1/1900
Parsing this file will create:
4321 Test 1/2/1900
1234 Test 1/1/1900
I want to eliminate the extra | in the second row "Test|" or allow pandas to understand that it exists to create:
4321 Test 1/2/1900
1234 Test 1/1/1900
or this would be fine:
1234 Test| 1/1/1900
I have attempted to use converters, other quoting methods (quotchars, etc), but to no avail.
Any ideas on how to get by this? All recommendations welcome.
Eric
python pandas
add a comment |
I am using Pandas to import a text file like so:
data = pd.read_csv('filepath.txt', sep='|', quoting=3,
error_bad_lines=False, encoding='latin1', low_memory=False)
I'm getting an error on 1 line because the field value has a Pipe found within it. When it attempts to parse the row, it finds that the length of the row is too long throwing an error. It allows the file to process; however, this row is missing.
Example:
Row -
4321|Test|1/2/1900
1234|Test||1/1/1900
Parsing this file will create:
4321 Test 1/2/1900
1234 Test 1/1/1900
I want to eliminate the extra | in the second row "Test|" or allow pandas to understand that it exists to create:
4321 Test 1/2/1900
1234 Test 1/1/1900
or this would be fine:
1234 Test| 1/1/1900
I have attempted to use converters, other quoting methods (quotchars, etc), but to no avail.
Any ideas on how to get by this? All recommendations welcome.
Eric
python pandas
Can you guarantee this is a problem with only the first line?
– jpp
Nov 15 '18 at 16:08
Well, it's not actually the first line. It's line 116950 of the file. It will be random; however, I could potentially strip that string column out because I don't need it. Not sure how to evaluate the file before pandas consumes it.
– Eric
Nov 15 '18 at 16:20
add a comment |
I am using Pandas to import a text file like so:
data = pd.read_csv('filepath.txt', sep='|', quoting=3,
error_bad_lines=False, encoding='latin1', low_memory=False)
I'm getting an error on 1 line because the field value has a Pipe found within it. When it attempts to parse the row, it finds that the length of the row is too long throwing an error. It allows the file to process; however, this row is missing.
Example:
Row -
4321|Test|1/2/1900
1234|Test||1/1/1900
Parsing this file will create:
4321 Test 1/2/1900
1234 Test 1/1/1900
I want to eliminate the extra | in the second row "Test|" or allow pandas to understand that it exists to create:
4321 Test 1/2/1900
1234 Test 1/1/1900
or this would be fine:
1234 Test| 1/1/1900
I have attempted to use converters, other quoting methods (quotchars, etc), but to no avail.
Any ideas on how to get by this? All recommendations welcome.
Eric
python pandas
I am using Pandas to import a text file like so:
data = pd.read_csv('filepath.txt', sep='|', quoting=3,
error_bad_lines=False, encoding='latin1', low_memory=False)
I'm getting an error on 1 line because the field value has a Pipe found within it. When it attempts to parse the row, it finds that the length of the row is too long throwing an error. It allows the file to process; however, this row is missing.
Example:
Row -
4321|Test|1/2/1900
1234|Test||1/1/1900
Parsing this file will create:
4321 Test 1/2/1900
1234 Test 1/1/1900
I want to eliminate the extra | in the second row "Test|" or allow pandas to understand that it exists to create:
4321 Test 1/2/1900
1234 Test 1/1/1900
or this would be fine:
1234 Test| 1/1/1900
I have attempted to use converters, other quoting methods (quotchars, etc), but to no avail.
Any ideas on how to get by this? All recommendations welcome.
Eric
python pandas
python pandas
edited Nov 15 '18 at 17:12
Eric
asked Nov 15 '18 at 16:03
EricEric
256
256
Can you guarantee this is a problem with only the first line?
– jpp
Nov 15 '18 at 16:08
Well, it's not actually the first line. It's line 116950 of the file. It will be random; however, I could potentially strip that string column out because I don't need it. Not sure how to evaluate the file before pandas consumes it.
– Eric
Nov 15 '18 at 16:20
add a comment |
Can you guarantee this is a problem with only the first line?
– jpp
Nov 15 '18 at 16:08
Well, it's not actually the first line. It's line 116950 of the file. It will be random; however, I could potentially strip that string column out because I don't need it. Not sure how to evaluate the file before pandas consumes it.
– Eric
Nov 15 '18 at 16:20
Can you guarantee this is a problem with only the first line?
– jpp
Nov 15 '18 at 16:08
Can you guarantee this is a problem with only the first line?
– jpp
Nov 15 '18 at 16:08
Well, it's not actually the first line. It's line 116950 of the file. It will be random; however, I could potentially strip that string column out because I don't need it. Not sure how to evaluate the file before pandas consumes it.
– Eric
Nov 15 '18 at 16:20
Well, it's not actually the first line. It's line 116950 of the file. It will be random; however, I could potentially strip that string column out because I don't need it. Not sure how to evaluate the file before pandas consumes it.
– Eric
Nov 15 '18 at 16:20
add a comment |
1 Answer
1
active
oldest
votes
I think the easiest way would be to remove any instance of "||" then use pandas. An example of this would be:
import pandas as pd
from io import StringIO
buffer= StringIO()
with open(r'filepath.txt', 'r') as f:
for line in f.readlines():
if "||" not in line:
buffer.write(line)
buffer.seek(0)
data = pd.read_csv(buffer, sep='|', quoting=3,
error_bad_lines=False, encoding='latin1', low_memory=False)
You could also do it outside of python with a find and replace operation.
I considered this as an option. Unfortunately, NULL values and Non-NULL values exist within the file inside the same column, so this option would generate the same issue, just on different rows.
– Eric
Nov 15 '18 at 18:32
Are you okay with dropping every row that contains a null value?
– krflol
Nov 15 '18 at 19:03
Absolutely. I'm good with just deleting the whole column, if possible. I'd just like to do it programmatically instead of manually. I'm willing to accept any possible solution.
– Eric
Nov 15 '18 at 21:13
edited. Should remove any line with '||'
– krflol
Nov 15 '18 at 21:16
This causes my file to drop to only 2 rows as most rows have a NULL value found in an individual column. It ignores all the rows where '||' is found. I think I may just have to request the file delimiter to be changed.
– Eric
Nov 16 '18 at 15:43
|
show 1 more comment
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53323388%2fpandas-pipe-delimiter-exists-in-field-value-causing-bad-lines%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I think the easiest way would be to remove any instance of "||" then use pandas. An example of this would be:
import pandas as pd
from io import StringIO
buffer= StringIO()
with open(r'filepath.txt', 'r') as f:
for line in f.readlines():
if "||" not in line:
buffer.write(line)
buffer.seek(0)
data = pd.read_csv(buffer, sep='|', quoting=3,
error_bad_lines=False, encoding='latin1', low_memory=False)
You could also do it outside of python with a find and replace operation.
I considered this as an option. Unfortunately, NULL values and Non-NULL values exist within the file inside the same column, so this option would generate the same issue, just on different rows.
– Eric
Nov 15 '18 at 18:32
Are you okay with dropping every row that contains a null value?
– krflol
Nov 15 '18 at 19:03
Absolutely. I'm good with just deleting the whole column, if possible. I'd just like to do it programmatically instead of manually. I'm willing to accept any possible solution.
– Eric
Nov 15 '18 at 21:13
edited. Should remove any line with '||'
– krflol
Nov 15 '18 at 21:16
This causes my file to drop to only 2 rows as most rows have a NULL value found in an individual column. It ignores all the rows where '||' is found. I think I may just have to request the file delimiter to be changed.
– Eric
Nov 16 '18 at 15:43
|
show 1 more comment
I think the easiest way would be to remove any instance of "||" then use pandas. An example of this would be:
import pandas as pd
from io import StringIO
buffer= StringIO()
with open(r'filepath.txt', 'r') as f:
for line in f.readlines():
if "||" not in line:
buffer.write(line)
buffer.seek(0)
data = pd.read_csv(buffer, sep='|', quoting=3,
error_bad_lines=False, encoding='latin1', low_memory=False)
You could also do it outside of python with a find and replace operation.
I considered this as an option. Unfortunately, NULL values and Non-NULL values exist within the file inside the same column, so this option would generate the same issue, just on different rows.
– Eric
Nov 15 '18 at 18:32
Are you okay with dropping every row that contains a null value?
– krflol
Nov 15 '18 at 19:03
Absolutely. I'm good with just deleting the whole column, if possible. I'd just like to do it programmatically instead of manually. I'm willing to accept any possible solution.
– Eric
Nov 15 '18 at 21:13
edited. Should remove any line with '||'
– krflol
Nov 15 '18 at 21:16
This causes my file to drop to only 2 rows as most rows have a NULL value found in an individual column. It ignores all the rows where '||' is found. I think I may just have to request the file delimiter to be changed.
– Eric
Nov 16 '18 at 15:43
|
show 1 more comment
I think the easiest way would be to remove any instance of "||" then use pandas. An example of this would be:
import pandas as pd
from io import StringIO
buffer= StringIO()
with open(r'filepath.txt', 'r') as f:
for line in f.readlines():
if "||" not in line:
buffer.write(line)
buffer.seek(0)
data = pd.read_csv(buffer, sep='|', quoting=3,
error_bad_lines=False, encoding='latin1', low_memory=False)
You could also do it outside of python with a find and replace operation.
I think the easiest way would be to remove any instance of "||" then use pandas. An example of this would be:
import pandas as pd
from io import StringIO
buffer= StringIO()
with open(r'filepath.txt', 'r') as f:
for line in f.readlines():
if "||" not in line:
buffer.write(line)
buffer.seek(0)
data = pd.read_csv(buffer, sep='|', quoting=3,
error_bad_lines=False, encoding='latin1', low_memory=False)
You could also do it outside of python with a find and replace operation.
edited Nov 15 '18 at 21:15
answered Nov 15 '18 at 17:20
krflolkrflol
53728
53728
I considered this as an option. Unfortunately, NULL values and Non-NULL values exist within the file inside the same column, so this option would generate the same issue, just on different rows.
– Eric
Nov 15 '18 at 18:32
Are you okay with dropping every row that contains a null value?
– krflol
Nov 15 '18 at 19:03
Absolutely. I'm good with just deleting the whole column, if possible. I'd just like to do it programmatically instead of manually. I'm willing to accept any possible solution.
– Eric
Nov 15 '18 at 21:13
edited. Should remove any line with '||'
– krflol
Nov 15 '18 at 21:16
This causes my file to drop to only 2 rows as most rows have a NULL value found in an individual column. It ignores all the rows where '||' is found. I think I may just have to request the file delimiter to be changed.
– Eric
Nov 16 '18 at 15:43
|
show 1 more comment
I considered this as an option. Unfortunately, NULL values and Non-NULL values exist within the file inside the same column, so this option would generate the same issue, just on different rows.
– Eric
Nov 15 '18 at 18:32
Are you okay with dropping every row that contains a null value?
– krflol
Nov 15 '18 at 19:03
Absolutely. I'm good with just deleting the whole column, if possible. I'd just like to do it programmatically instead of manually. I'm willing to accept any possible solution.
– Eric
Nov 15 '18 at 21:13
edited. Should remove any line with '||'
– krflol
Nov 15 '18 at 21:16
This causes my file to drop to only 2 rows as most rows have a NULL value found in an individual column. It ignores all the rows where '||' is found. I think I may just have to request the file delimiter to be changed.
– Eric
Nov 16 '18 at 15:43
I considered this as an option. Unfortunately, NULL values and Non-NULL values exist within the file inside the same column, so this option would generate the same issue, just on different rows.
– Eric
Nov 15 '18 at 18:32
I considered this as an option. Unfortunately, NULL values and Non-NULL values exist within the file inside the same column, so this option would generate the same issue, just on different rows.
– Eric
Nov 15 '18 at 18:32
Are you okay with dropping every row that contains a null value?
– krflol
Nov 15 '18 at 19:03
Are you okay with dropping every row that contains a null value?
– krflol
Nov 15 '18 at 19:03
Absolutely. I'm good with just deleting the whole column, if possible. I'd just like to do it programmatically instead of manually. I'm willing to accept any possible solution.
– Eric
Nov 15 '18 at 21:13
Absolutely. I'm good with just deleting the whole column, if possible. I'd just like to do it programmatically instead of manually. I'm willing to accept any possible solution.
– Eric
Nov 15 '18 at 21:13
edited. Should remove any line with '||'
– krflol
Nov 15 '18 at 21:16
edited. Should remove any line with '||'
– krflol
Nov 15 '18 at 21:16
This causes my file to drop to only 2 rows as most rows have a NULL value found in an individual column. It ignores all the rows where '||' is found. I think I may just have to request the file delimiter to be changed.
– Eric
Nov 16 '18 at 15:43
This causes my file to drop to only 2 rows as most rows have a NULL value found in an individual column. It ignores all the rows where '||' is found. I think I may just have to request the file delimiter to be changed.
– Eric
Nov 16 '18 at 15:43
|
show 1 more comment
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53323388%2fpandas-pipe-delimiter-exists-in-field-value-causing-bad-lines%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Can you guarantee this is a problem with only the first line?
– jpp
Nov 15 '18 at 16:08
Well, it's not actually the first line. It's line 116950 of the file. It will be random; however, I could potentially strip that string column out because I don't need it. Not sure how to evaluate the file before pandas consumes it.
– Eric
Nov 15 '18 at 16:20