Extracting relevant data from a txt file
I know how to extract data from a .txt file if it has a certain format (columns with certain spacing) using numpy.loadtxt
, but I'm facing currently a problem a bit more complicating. Let's say a have data of the following format:
*** model xy ***
date: 11.14.18 gate time: 190 sec
enviroment Ug= 483 counts time: 09:19:55
enviroment Ug= 777 counts time: 09:21:55
enviroment Ug= 854 counts time: 09:53:55
.
.
.
The relevant information for me are the counts and the gate time. I know I can use open("some txt file", "r")
to read in a txt file, but I don't know how to remove the useless information of each line.
python
add a comment |
I know how to extract data from a .txt file if it has a certain format (columns with certain spacing) using numpy.loadtxt
, but I'm facing currently a problem a bit more complicating. Let's say a have data of the following format:
*** model xy ***
date: 11.14.18 gate time: 190 sec
enviroment Ug= 483 counts time: 09:19:55
enviroment Ug= 777 counts time: 09:21:55
enviroment Ug= 854 counts time: 09:53:55
.
.
.
The relevant information for me are the counts and the gate time. I know I can use open("some txt file", "r")
to read in a txt file, but I don't know how to remove the useless information of each line.
python
Possible duplicate of How to efficiently parse fixed width files?
– cha0site
Nov 14 '18 at 14:23
Is gate time only in one line? or all of the times are gate times too?
– Muhammad Ahmad
Nov 14 '18 at 14:29
@MuhammadAhmad gate time is only in the first line, the orher times are the moments when the measurement was finished and for me irrelevant.
– Sito
Nov 14 '18 at 14:30
add a comment |
I know how to extract data from a .txt file if it has a certain format (columns with certain spacing) using numpy.loadtxt
, but I'm facing currently a problem a bit more complicating. Let's say a have data of the following format:
*** model xy ***
date: 11.14.18 gate time: 190 sec
enviroment Ug= 483 counts time: 09:19:55
enviroment Ug= 777 counts time: 09:21:55
enviroment Ug= 854 counts time: 09:53:55
.
.
.
The relevant information for me are the counts and the gate time. I know I can use open("some txt file", "r")
to read in a txt file, but I don't know how to remove the useless information of each line.
python
I know how to extract data from a .txt file if it has a certain format (columns with certain spacing) using numpy.loadtxt
, but I'm facing currently a problem a bit more complicating. Let's say a have data of the following format:
*** model xy ***
date: 11.14.18 gate time: 190 sec
enviroment Ug= 483 counts time: 09:19:55
enviroment Ug= 777 counts time: 09:21:55
enviroment Ug= 854 counts time: 09:53:55
.
.
.
The relevant information for me are the counts and the gate time. I know I can use open("some txt file", "r")
to read in a txt file, but I don't know how to remove the useless information of each line.
python
python
asked Nov 14 '18 at 14:17
SitoSito
192212
192212
Possible duplicate of How to efficiently parse fixed width files?
– cha0site
Nov 14 '18 at 14:23
Is gate time only in one line? or all of the times are gate times too?
– Muhammad Ahmad
Nov 14 '18 at 14:29
@MuhammadAhmad gate time is only in the first line, the orher times are the moments when the measurement was finished and for me irrelevant.
– Sito
Nov 14 '18 at 14:30
add a comment |
Possible duplicate of How to efficiently parse fixed width files?
– cha0site
Nov 14 '18 at 14:23
Is gate time only in one line? or all of the times are gate times too?
– Muhammad Ahmad
Nov 14 '18 at 14:29
@MuhammadAhmad gate time is only in the first line, the orher times are the moments when the measurement was finished and for me irrelevant.
– Sito
Nov 14 '18 at 14:30
Possible duplicate of How to efficiently parse fixed width files?
– cha0site
Nov 14 '18 at 14:23
Possible duplicate of How to efficiently parse fixed width files?
– cha0site
Nov 14 '18 at 14:23
Is gate time only in one line? or all of the times are gate times too?
– Muhammad Ahmad
Nov 14 '18 at 14:29
Is gate time only in one line? or all of the times are gate times too?
– Muhammad Ahmad
Nov 14 '18 at 14:29
@MuhammadAhmad gate time is only in the first line, the orher times are the moments when the measurement was finished and for me irrelevant.
– Sito
Nov 14 '18 at 14:30
@MuhammadAhmad gate time is only in the first line, the orher times are the moments when the measurement was finished and for me irrelevant.
– Sito
Nov 14 '18 at 14:30
add a comment |
3 Answers
3
active
oldest
votes
You can simply read all of the text from the file at once, and find the required data with a regex:
import re
with open("some txt file", "r") as fin:
all_text = fin.read()
# Find the gate time
gate_time_r = re.compile(r'gates+time:s+(d+)', re.IGNORECASE)
gate_time = int(gate_time_r.search(all_text).groups()[0])
# Find the counts
counts_r = re.compile(r'enviroments+ug=s+(d+)', re.IGNORECASE)
counts_list = list(map(int, counts_r.findall(all_text)))
Gate time regex: gates+time:s+(d+)
simply matches a pattern where there comes a number after string gate time:
, and matches that number in a group. And you can simply run this regex with gate_time_r.search(all_text)
, it will find a match, and you can pick its first group.
Counts regex: enviroments+ug=s+(d+)
. It matches a pattern where tehre comes a number after enciroment ug=
, and picks that number in a group.
As there are more than one matches in the all_text
string for this, you can use findall
to search for all of the matches.
It will return a list of groups present in the regex, so it will be the list of actual counts. Simply cast it to int if you want.
add a comment |
You need to read txt line by line you can use readlines()
for that purpose. For each line that starts from 2th row you can split string
"enviroment Ug= 483 counts time: 09:19:55".split()
this will result with
['enviroment', 'Ug=', '483', 'counts', 'time:', '09:19:55']
you can access [2]
and [-1]
element to get informations that you need
add a comment |
Try using pandas
for this:
Assuming your file to be fixed-width
file with 1st record as header, you can do the following:
In [1961]: df = pd.read_fwf('t.txt')
In [1962]: df
Out[1962]:
date: 11.14.18 Unnamed: 1 Unnamed: 2 gate time: 190 sec
0 enviroment Ug= 483 counts time: 09:19:55 NaN
1 enviroment Ug= 777 counts time: 09:21:55 NaN
2 enviroment Ug= 854 counts time: 09:53:55 NaN
In [1963]: df.columns
Out[1963]:
Index([u'date: 11.14.18', u'Unnamed: 1', u'Unnamed: 2', u'gate time: 190',
u'sec'],
dtype='object')
# the above gives you the column names.
#You can see in `df` that the counts values and gate_time values lie in individual columns.
So, just extract those columns from the dataframe(df):
In [1967]: df[['Unnamed: 1', 'gate time: 190']]
Out[1967]:
Unnamed: 1 gate time: 190
0 483 time: 09:19:55
1 777 time: 09:21:55
2 854 time: 09:53:55
Now, you can write the above in a csv
file.
In [1968]: df.to_csv('/home/mayankp/Desktop/tt.csv', header=False, index=False, columns=['Unnamed: 1', 'gate time: 190'])
This approach basically saves you from using for loops and complex regex.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53302310%2fextracting-relevant-data-from-a-txt-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
You can simply read all of the text from the file at once, and find the required data with a regex:
import re
with open("some txt file", "r") as fin:
all_text = fin.read()
# Find the gate time
gate_time_r = re.compile(r'gates+time:s+(d+)', re.IGNORECASE)
gate_time = int(gate_time_r.search(all_text).groups()[0])
# Find the counts
counts_r = re.compile(r'enviroments+ug=s+(d+)', re.IGNORECASE)
counts_list = list(map(int, counts_r.findall(all_text)))
Gate time regex: gates+time:s+(d+)
simply matches a pattern where there comes a number after string gate time:
, and matches that number in a group. And you can simply run this regex with gate_time_r.search(all_text)
, it will find a match, and you can pick its first group.
Counts regex: enviroments+ug=s+(d+)
. It matches a pattern where tehre comes a number after enciroment ug=
, and picks that number in a group.
As there are more than one matches in the all_text
string for this, you can use findall
to search for all of the matches.
It will return a list of groups present in the regex, so it will be the list of actual counts. Simply cast it to int if you want.
add a comment |
You can simply read all of the text from the file at once, and find the required data with a regex:
import re
with open("some txt file", "r") as fin:
all_text = fin.read()
# Find the gate time
gate_time_r = re.compile(r'gates+time:s+(d+)', re.IGNORECASE)
gate_time = int(gate_time_r.search(all_text).groups()[0])
# Find the counts
counts_r = re.compile(r'enviroments+ug=s+(d+)', re.IGNORECASE)
counts_list = list(map(int, counts_r.findall(all_text)))
Gate time regex: gates+time:s+(d+)
simply matches a pattern where there comes a number after string gate time:
, and matches that number in a group. And you can simply run this regex with gate_time_r.search(all_text)
, it will find a match, and you can pick its first group.
Counts regex: enviroments+ug=s+(d+)
. It matches a pattern where tehre comes a number after enciroment ug=
, and picks that number in a group.
As there are more than one matches in the all_text
string for this, you can use findall
to search for all of the matches.
It will return a list of groups present in the regex, so it will be the list of actual counts. Simply cast it to int if you want.
add a comment |
You can simply read all of the text from the file at once, and find the required data with a regex:
import re
with open("some txt file", "r") as fin:
all_text = fin.read()
# Find the gate time
gate_time_r = re.compile(r'gates+time:s+(d+)', re.IGNORECASE)
gate_time = int(gate_time_r.search(all_text).groups()[0])
# Find the counts
counts_r = re.compile(r'enviroments+ug=s+(d+)', re.IGNORECASE)
counts_list = list(map(int, counts_r.findall(all_text)))
Gate time regex: gates+time:s+(d+)
simply matches a pattern where there comes a number after string gate time:
, and matches that number in a group. And you can simply run this regex with gate_time_r.search(all_text)
, it will find a match, and you can pick its first group.
Counts regex: enviroments+ug=s+(d+)
. It matches a pattern where tehre comes a number after enciroment ug=
, and picks that number in a group.
As there are more than one matches in the all_text
string for this, you can use findall
to search for all of the matches.
It will return a list of groups present in the regex, so it will be the list of actual counts. Simply cast it to int if you want.
You can simply read all of the text from the file at once, and find the required data with a regex:
import re
with open("some txt file", "r") as fin:
all_text = fin.read()
# Find the gate time
gate_time_r = re.compile(r'gates+time:s+(d+)', re.IGNORECASE)
gate_time = int(gate_time_r.search(all_text).groups()[0])
# Find the counts
counts_r = re.compile(r'enviroments+ug=s+(d+)', re.IGNORECASE)
counts_list = list(map(int, counts_r.findall(all_text)))
Gate time regex: gates+time:s+(d+)
simply matches a pattern where there comes a number after string gate time:
, and matches that number in a group. And you can simply run this regex with gate_time_r.search(all_text)
, it will find a match, and you can pick its first group.
Counts regex: enviroments+ug=s+(d+)
. It matches a pattern where tehre comes a number after enciroment ug=
, and picks that number in a group.
As there are more than one matches in the all_text
string for this, you can use findall
to search for all of the matches.
It will return a list of groups present in the regex, so it will be the list of actual counts. Simply cast it to int if you want.
edited Nov 14 '18 at 14:42
answered Nov 14 '18 at 14:33
Muhammad AhmadMuhammad Ahmad
2,1321422
2,1321422
add a comment |
add a comment |
You need to read txt line by line you can use readlines()
for that purpose. For each line that starts from 2th row you can split string
"enviroment Ug= 483 counts time: 09:19:55".split()
this will result with
['enviroment', 'Ug=', '483', 'counts', 'time:', '09:19:55']
you can access [2]
and [-1]
element to get informations that you need
add a comment |
You need to read txt line by line you can use readlines()
for that purpose. For each line that starts from 2th row you can split string
"enviroment Ug= 483 counts time: 09:19:55".split()
this will result with
['enviroment', 'Ug=', '483', 'counts', 'time:', '09:19:55']
you can access [2]
and [-1]
element to get informations that you need
add a comment |
You need to read txt line by line you can use readlines()
for that purpose. For each line that starts from 2th row you can split string
"enviroment Ug= 483 counts time: 09:19:55".split()
this will result with
['enviroment', 'Ug=', '483', 'counts', 'time:', '09:19:55']
you can access [2]
and [-1]
element to get informations that you need
You need to read txt line by line you can use readlines()
for that purpose. For each line that starts from 2th row you can split string
"enviroment Ug= 483 counts time: 09:19:55".split()
this will result with
['enviroment', 'Ug=', '483', 'counts', 'time:', '09:19:55']
you can access [2]
and [-1]
element to get informations that you need
answered Nov 14 '18 at 14:24
İhsan Cemil Çiçekİhsan Cemil Çiçek
13111
13111
add a comment |
add a comment |
Try using pandas
for this:
Assuming your file to be fixed-width
file with 1st record as header, you can do the following:
In [1961]: df = pd.read_fwf('t.txt')
In [1962]: df
Out[1962]:
date: 11.14.18 Unnamed: 1 Unnamed: 2 gate time: 190 sec
0 enviroment Ug= 483 counts time: 09:19:55 NaN
1 enviroment Ug= 777 counts time: 09:21:55 NaN
2 enviroment Ug= 854 counts time: 09:53:55 NaN
In [1963]: df.columns
Out[1963]:
Index([u'date: 11.14.18', u'Unnamed: 1', u'Unnamed: 2', u'gate time: 190',
u'sec'],
dtype='object')
# the above gives you the column names.
#You can see in `df` that the counts values and gate_time values lie in individual columns.
So, just extract those columns from the dataframe(df):
In [1967]: df[['Unnamed: 1', 'gate time: 190']]
Out[1967]:
Unnamed: 1 gate time: 190
0 483 time: 09:19:55
1 777 time: 09:21:55
2 854 time: 09:53:55
Now, you can write the above in a csv
file.
In [1968]: df.to_csv('/home/mayankp/Desktop/tt.csv', header=False, index=False, columns=['Unnamed: 1', 'gate time: 190'])
This approach basically saves you from using for loops and complex regex.
add a comment |
Try using pandas
for this:
Assuming your file to be fixed-width
file with 1st record as header, you can do the following:
In [1961]: df = pd.read_fwf('t.txt')
In [1962]: df
Out[1962]:
date: 11.14.18 Unnamed: 1 Unnamed: 2 gate time: 190 sec
0 enviroment Ug= 483 counts time: 09:19:55 NaN
1 enviroment Ug= 777 counts time: 09:21:55 NaN
2 enviroment Ug= 854 counts time: 09:53:55 NaN
In [1963]: df.columns
Out[1963]:
Index([u'date: 11.14.18', u'Unnamed: 1', u'Unnamed: 2', u'gate time: 190',
u'sec'],
dtype='object')
# the above gives you the column names.
#You can see in `df` that the counts values and gate_time values lie in individual columns.
So, just extract those columns from the dataframe(df):
In [1967]: df[['Unnamed: 1', 'gate time: 190']]
Out[1967]:
Unnamed: 1 gate time: 190
0 483 time: 09:19:55
1 777 time: 09:21:55
2 854 time: 09:53:55
Now, you can write the above in a csv
file.
In [1968]: df.to_csv('/home/mayankp/Desktop/tt.csv', header=False, index=False, columns=['Unnamed: 1', 'gate time: 190'])
This approach basically saves you from using for loops and complex regex.
add a comment |
Try using pandas
for this:
Assuming your file to be fixed-width
file with 1st record as header, you can do the following:
In [1961]: df = pd.read_fwf('t.txt')
In [1962]: df
Out[1962]:
date: 11.14.18 Unnamed: 1 Unnamed: 2 gate time: 190 sec
0 enviroment Ug= 483 counts time: 09:19:55 NaN
1 enviroment Ug= 777 counts time: 09:21:55 NaN
2 enviroment Ug= 854 counts time: 09:53:55 NaN
In [1963]: df.columns
Out[1963]:
Index([u'date: 11.14.18', u'Unnamed: 1', u'Unnamed: 2', u'gate time: 190',
u'sec'],
dtype='object')
# the above gives you the column names.
#You can see in `df` that the counts values and gate_time values lie in individual columns.
So, just extract those columns from the dataframe(df):
In [1967]: df[['Unnamed: 1', 'gate time: 190']]
Out[1967]:
Unnamed: 1 gate time: 190
0 483 time: 09:19:55
1 777 time: 09:21:55
2 854 time: 09:53:55
Now, you can write the above in a csv
file.
In [1968]: df.to_csv('/home/mayankp/Desktop/tt.csv', header=False, index=False, columns=['Unnamed: 1', 'gate time: 190'])
This approach basically saves you from using for loops and complex regex.
Try using pandas
for this:
Assuming your file to be fixed-width
file with 1st record as header, you can do the following:
In [1961]: df = pd.read_fwf('t.txt')
In [1962]: df
Out[1962]:
date: 11.14.18 Unnamed: 1 Unnamed: 2 gate time: 190 sec
0 enviroment Ug= 483 counts time: 09:19:55 NaN
1 enviroment Ug= 777 counts time: 09:21:55 NaN
2 enviroment Ug= 854 counts time: 09:53:55 NaN
In [1963]: df.columns
Out[1963]:
Index([u'date: 11.14.18', u'Unnamed: 1', u'Unnamed: 2', u'gate time: 190',
u'sec'],
dtype='object')
# the above gives you the column names.
#You can see in `df` that the counts values and gate_time values lie in individual columns.
So, just extract those columns from the dataframe(df):
In [1967]: df[['Unnamed: 1', 'gate time: 190']]
Out[1967]:
Unnamed: 1 gate time: 190
0 483 time: 09:19:55
1 777 time: 09:21:55
2 854 time: 09:53:55
Now, you can write the above in a csv
file.
In [1968]: df.to_csv('/home/mayankp/Desktop/tt.csv', header=False, index=False, columns=['Unnamed: 1', 'gate time: 190'])
This approach basically saves you from using for loops and complex regex.
edited Nov 14 '18 at 14:40
answered Nov 14 '18 at 14:27
Mayank PorwalMayank Porwal
4,9352724
4,9352724
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53302310%2fextracting-relevant-data-from-a-txt-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Possible duplicate of How to efficiently parse fixed width files?
– cha0site
Nov 14 '18 at 14:23
Is gate time only in one line? or all of the times are gate times too?
– Muhammad Ahmad
Nov 14 '18 at 14:29
@MuhammadAhmad gate time is only in the first line, the orher times are the moments when the measurement was finished and for me irrelevant.
– Sito
Nov 14 '18 at 14:30