Extracting relevant data from a txt file

I know how to extract data from a .txt file if it has a certain format (columns with certain spacing) using numpy.loadtxt, but I'm facing currently a problem a bit more complicating. Let's say a have data of the following format:

*** model xy ***    

    date: 11.14.18                         gate time: 190 sec

    enviroment Ug=    483 counts        time: 09:19:55

    enviroment Ug=    777 counts        time: 09:21:55

    enviroment Ug=    854 counts        time: 09:53:55

                          .

                          .

                          .

The relevant information for me are the counts and the gate time. I know I can use open("some txt file", "r") to read in a txt file, but I don't know how to remove the useless information of each line.

asked Nov 14 '18 at 14:17

Sito

192212

Possible duplicate of How to efficiently parse fixed width files?

– cha0site
Nov 14 '18 at 14:23

Is gate time only in one line? or all of the times are gate times too?

– Muhammad Ahmad
Nov 14 '18 at 14:29

@MuhammadAhmad gate time is only in the first line, the orher times are the moments when the measurement was finished and for me irrelevant.

– Sito
Nov 14 '18 at 14:30

add a comment |

*** model xy ***    

    date: 11.14.18                         gate time: 190 sec

    enviroment Ug=    483 counts        time: 09:19:55

    enviroment Ug=    777 counts        time: 09:21:55

    enviroment Ug=    854 counts        time: 09:53:55

                          .

                          .

                          .

asked Nov 14 '18 at 14:17

Sito

192212

Possible duplicate of How to efficiently parse fixed width files?

– cha0site
Nov 14 '18 at 14:23

Is gate time only in one line? or all of the times are gate times too?

– Muhammad Ahmad
Nov 14 '18 at 14:29

@MuhammadAhmad gate time is only in the first line, the orher times are the moments when the measurement was finished and for me irrelevant.

– Sito
Nov 14 '18 at 14:30

add a comment |

*** model xy ***    

    date: 11.14.18                         gate time: 190 sec

    enviroment Ug=    483 counts        time: 09:19:55

    enviroment Ug=    777 counts        time: 09:21:55

    enviroment Ug=    854 counts        time: 09:53:55

                          .

                          .

                          .

asked Nov 14 '18 at 14:17

Sito

192212

*** model xy ***    

    date: 11.14.18                         gate time: 190 sec

    enviroment Ug=    483 counts        time: 09:19:55

    enviroment Ug=    777 counts        time: 09:21:55

    enviroment Ug=    854 counts        time: 09:53:55

                          .

                          .

                          .

python

asked Nov 14 '18 at 14:17

Sito

192212

asked Nov 14 '18 at 14:17

Sito

192212

asked Nov 14 '18 at 14:17

Sito

192212

asked Nov 14 '18 at 14:17

Sito

192212

asked Nov 14 '18 at 14:17

Sito

192212

Possible duplicate of How to efficiently parse fixed width files?

– cha0site
Nov 14 '18 at 14:23

Is gate time only in one line? or all of the times are gate times too?

– Muhammad Ahmad
Nov 14 '18 at 14:29

@MuhammadAhmad gate time is only in the first line, the orher times are the moments when the measurement was finished and for me irrelevant.

– Sito
Nov 14 '18 at 14:30

add a comment |

Possible duplicate of How to efficiently parse fixed width files?

– cha0site
Nov 14 '18 at 14:23

Is gate time only in one line? or all of the times are gate times too?

– Muhammad Ahmad
Nov 14 '18 at 14:29

@MuhammadAhmad gate time is only in the first line, the orher times are the moments when the measurement was finished and for me irrelevant.

– Sito
Nov 14 '18 at 14:30

Possible duplicate of How to efficiently parse fixed width files?

– cha0site
Nov 14 '18 at 14:23

Is gate time only in one line? or all of the times are gate times too?

– Muhammad Ahmad
Nov 14 '18 at 14:29

@MuhammadAhmad gate time is only in the first line, the orher times are the moments when the measurement was finished and for me irrelevant.

– Sito
Nov 14 '18 at 14:30

add a comment |

3 Answers
3

active

oldest

votes

You can simply read all of the text from the file at once, and find the required data with a regex:

import re

with open("some txt file", "r") as fin:

    all_text = fin.read()



    # Find the gate time

    gate_time_r = re.compile(r'gates+time:s+(d+)', re.IGNORECASE)

    gate_time = int(gate_time_r.search(all_text).groups()[0])



    # Find the counts

    counts_r = re.compile(r'enviroments+ug=s+(d+)', re.IGNORECASE)

    counts_list = list(map(int, counts_r.findall(all_text)))

Gate time regex: gates+time:s+(d+) simply matches a pattern where there comes a number after string gate time:, and matches that number in a group. And you can simply run this regex with gate_time_r.search(all_text), it will find a match, and you can pick its first group.

Counts regex: enviroments+ug=s+(d+). It matches a pattern where tehre comes a number after enciroment ug=, and picks that number in a group.

As there are more than one matches in the all_text string for this, you can use findall to search for all of the matches.

It will return a list of groups present in the regex, so it will be the list of actual counts. Simply cast it to int if you want.

edited Nov 14 '18 at 14:42

answered Nov 14 '18 at 14:33

Muhammad Ahmad

2,1321422

add a comment |

You need to read txt line by line you can use readlines() for that purpose. For each line that starts from 2th row you can split string

"enviroment Ug=    483 counts        time: 09:19:55".split()

this will result with

['enviroment', 'Ug=', '483', 'counts', 'time:', '09:19:55']

you can access [2] and [-1] element to get informations that you need

answered Nov 14 '18 at 14:24

İhsan Cemil Çiçek

13111

add a comment |

Try using pandas for this:

Assuming your file to be fixed-width file with 1st record as header, you can do the following:

In [1961]: df = pd.read_fwf('t.txt')



In [1962]: df

Out[1962]: 

   date: 11.14.18  Unnamed: 1 Unnamed: 2  gate time: 190  sec

0  enviroment Ug=         483     counts  time: 09:19:55  NaN

1  enviroment Ug=         777     counts  time: 09:21:55  NaN

2  enviroment Ug=         854     counts  time: 09:53:55  NaN



In [1963]: df.columns

Out[1963]: 

Index([u'date: 11.14.18', u'Unnamed: 1', u'Unnamed: 2', u'gate time: 190',

       u'sec'],

      dtype='object')



# the above gives you the column names. 

#You can see in `df` that the counts values  and gate_time values lie in individual columns.

So, just extract those columns from the dataframe(df):

In [1967]: df[['Unnamed: 1', 'gate time: 190']]

Out[1967]: 

   Unnamed: 1  gate time: 190

0         483  time: 09:19:55

1         777  time: 09:21:55

2         854  time: 09:53:55

Now, you can write the above in a csv file.

In [1968]: df.to_csv('/home/mayankp/Desktop/tt.csv', header=False, index=False, columns=['Unnamed: 1', 'gate time: 190'])

This approach basically saves you from using for loops and complex regex.

edited Nov 14 '18 at 14:40

answered Nov 14 '18 at 14:27

Mayank Porwal

4,9352724

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53302310%2fextracting-relevant-data-from-a-txt-file%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

You can simply read all of the text from the file at once, and find the required data with a regex:

import re

with open("some txt file", "r") as fin:

    all_text = fin.read()



    # Find the gate time

    gate_time_r = re.compile(r'gates+time:s+(d+)', re.IGNORECASE)

    gate_time = int(gate_time_r.search(all_text).groups()[0])



    # Find the counts

    counts_r = re.compile(r'enviroments+ug=s+(d+)', re.IGNORECASE)

    counts_list = list(map(int, counts_r.findall(all_text)))

Counts regex: enviroments+ug=s+(d+). It matches a pattern where tehre comes a number after enciroment ug=, and picks that number in a group.

As there are more than one matches in the all_text string for this, you can use findall to search for all of the matches.

It will return a list of groups present in the regex, so it will be the list of actual counts. Simply cast it to int if you want.

edited Nov 14 '18 at 14:42

answered Nov 14 '18 at 14:33

Muhammad Ahmad

2,1321422

add a comment |

You can simply read all of the text from the file at once, and find the required data with a regex:

import re

with open("some txt file", "r") as fin:

    all_text = fin.read()



    # Find the gate time

    gate_time_r = re.compile(r'gates+time:s+(d+)', re.IGNORECASE)

    gate_time = int(gate_time_r.search(all_text).groups()[0])



    # Find the counts

    counts_r = re.compile(r'enviroments+ug=s+(d+)', re.IGNORECASE)

    counts_list = list(map(int, counts_r.findall(all_text)))

Counts regex: enviroments+ug=s+(d+). It matches a pattern where tehre comes a number after enciroment ug=, and picks that number in a group.

As there are more than one matches in the all_text string for this, you can use findall to search for all of the matches.

It will return a list of groups present in the regex, so it will be the list of actual counts. Simply cast it to int if you want.

edited Nov 14 '18 at 14:42

answered Nov 14 '18 at 14:33

Muhammad Ahmad

2,1321422

add a comment |

You can simply read all of the text from the file at once, and find the required data with a regex:

import re

with open("some txt file", "r") as fin:

    all_text = fin.read()



    # Find the gate time

    gate_time_r = re.compile(r'gates+time:s+(d+)', re.IGNORECASE)

    gate_time = int(gate_time_r.search(all_text).groups()[0])



    # Find the counts

    counts_r = re.compile(r'enviroments+ug=s+(d+)', re.IGNORECASE)

    counts_list = list(map(int, counts_r.findall(all_text)))

Counts regex: enviroments+ug=s+(d+). It matches a pattern where tehre comes a number after enciroment ug=, and picks that number in a group.

As there are more than one matches in the all_text string for this, you can use findall to search for all of the matches.

It will return a list of groups present in the regex, so it will be the list of actual counts. Simply cast it to int if you want.

edited Nov 14 '18 at 14:42

answered Nov 14 '18 at 14:33

Muhammad Ahmad

2,1321422

You can simply read all of the text from the file at once, and find the required data with a regex:

import re

with open("some txt file", "r") as fin:

    all_text = fin.read()



    # Find the gate time

    gate_time_r = re.compile(r'gates+time:s+(d+)', re.IGNORECASE)

    gate_time = int(gate_time_r.search(all_text).groups()[0])



    # Find the counts

    counts_r = re.compile(r'enviroments+ug=s+(d+)', re.IGNORECASE)

    counts_list = list(map(int, counts_r.findall(all_text)))

Counts regex: enviroments+ug=s+(d+). It matches a pattern where tehre comes a number after enciroment ug=, and picks that number in a group.

As there are more than one matches in the all_text string for this, you can use findall to search for all of the matches.

It will return a list of groups present in the regex, so it will be the list of actual counts. Simply cast it to int if you want.

edited Nov 14 '18 at 14:42

answered Nov 14 '18 at 14:33

Muhammad Ahmad

2,1321422

edited Nov 14 '18 at 14:42

answered Nov 14 '18 at 14:33

Muhammad Ahmad

2,1321422

answered Nov 14 '18 at 14:33

Muhammad Ahmad

2,1321422

answered Nov 14 '18 at 14:33

Muhammad Ahmad

2,1321422

add a comment |

You need to read txt line by line you can use readlines() for that purpose. For each line that starts from 2th row you can split string

"enviroment Ug=    483 counts        time: 09:19:55".split()

this will result with

['enviroment', 'Ug=', '483', 'counts', 'time:', '09:19:55']

you can access [2] and [-1] element to get informations that you need

answered Nov 14 '18 at 14:24

İhsan Cemil Çiçek

13111

add a comment |

You need to read txt line by line you can use readlines() for that purpose. For each line that starts from 2th row you can split string

"enviroment Ug=    483 counts        time: 09:19:55".split()

this will result with

['enviroment', 'Ug=', '483', 'counts', 'time:', '09:19:55']

you can access [2] and [-1] element to get informations that you need

answered Nov 14 '18 at 14:24

İhsan Cemil Çiçek

13111

add a comment |

You need to read txt line by line you can use readlines() for that purpose. For each line that starts from 2th row you can split string

"enviroment Ug=    483 counts        time: 09:19:55".split()

this will result with

['enviroment', 'Ug=', '483', 'counts', 'time:', '09:19:55']

you can access [2] and [-1] element to get informations that you need

answered Nov 14 '18 at 14:24

İhsan Cemil Çiçek

13111

You need to read txt line by line you can use readlines() for that purpose. For each line that starts from 2th row you can split string

"enviroment Ug=    483 counts        time: 09:19:55".split()

this will result with

['enviroment', 'Ug=', '483', 'counts', 'time:', '09:19:55']

you can access [2] and [-1] element to get informations that you need

answered Nov 14 '18 at 14:24

İhsan Cemil Çiçek

13111

answered Nov 14 '18 at 14:24

İhsan Cemil Çiçek

13111

answered Nov 14 '18 at 14:24

İhsan Cemil Çiçek

13111

answered Nov 14 '18 at 14:24

İhsan Cemil Çiçek

13111

add a comment |

Try using pandas for this:

Assuming your file to be fixed-width file with 1st record as header, you can do the following:

In [1961]: df = pd.read_fwf('t.txt')



In [1962]: df

Out[1962]: 

   date: 11.14.18  Unnamed: 1 Unnamed: 2  gate time: 190  sec

0  enviroment Ug=         483     counts  time: 09:19:55  NaN

1  enviroment Ug=         777     counts  time: 09:21:55  NaN

2  enviroment Ug=         854     counts  time: 09:53:55  NaN



In [1963]: df.columns

Out[1963]: 

Index([u'date: 11.14.18', u'Unnamed: 1', u'Unnamed: 2', u'gate time: 190',

       u'sec'],

      dtype='object')



# the above gives you the column names. 

#You can see in `df` that the counts values  and gate_time values lie in individual columns.

So, just extract those columns from the dataframe(df):

In [1967]: df[['Unnamed: 1', 'gate time: 190']]

Out[1967]: 

   Unnamed: 1  gate time: 190

0         483  time: 09:19:55

1         777  time: 09:21:55

2         854  time: 09:53:55

Now, you can write the above in a csv file.

In [1968]: df.to_csv('/home/mayankp/Desktop/tt.csv', header=False, index=False, columns=['Unnamed: 1', 'gate time: 190'])

This approach basically saves you from using for loops and complex regex.

edited Nov 14 '18 at 14:40

answered Nov 14 '18 at 14:27

Mayank Porwal

4,9352724

add a comment |

Try using pandas for this:

Assuming your file to be fixed-width file with 1st record as header, you can do the following:

In [1961]: df = pd.read_fwf('t.txt')



In [1962]: df

Out[1962]: 

   date: 11.14.18  Unnamed: 1 Unnamed: 2  gate time: 190  sec

0  enviroment Ug=         483     counts  time: 09:19:55  NaN

1  enviroment Ug=         777     counts  time: 09:21:55  NaN

2  enviroment Ug=         854     counts  time: 09:53:55  NaN



In [1963]: df.columns

Out[1963]: 

Index([u'date: 11.14.18', u'Unnamed: 1', u'Unnamed: 2', u'gate time: 190',

       u'sec'],

      dtype='object')



# the above gives you the column names. 

#You can see in `df` that the counts values  and gate_time values lie in individual columns.

So, just extract those columns from the dataframe(df):

In [1967]: df[['Unnamed: 1', 'gate time: 190']]

Out[1967]: 

   Unnamed: 1  gate time: 190

0         483  time: 09:19:55

1         777  time: 09:21:55

2         854  time: 09:53:55

Now, you can write the above in a csv file.

In [1968]: df.to_csv('/home/mayankp/Desktop/tt.csv', header=False, index=False, columns=['Unnamed: 1', 'gate time: 190'])

This approach basically saves you from using for loops and complex regex.

edited Nov 14 '18 at 14:40

answered Nov 14 '18 at 14:27

Mayank Porwal

4,9352724

add a comment |

Try using pandas for this:

Assuming your file to be fixed-width file with 1st record as header, you can do the following:

In [1961]: df = pd.read_fwf('t.txt')



In [1962]: df

Out[1962]: 

   date: 11.14.18  Unnamed: 1 Unnamed: 2  gate time: 190  sec

0  enviroment Ug=         483     counts  time: 09:19:55  NaN

1  enviroment Ug=         777     counts  time: 09:21:55  NaN

2  enviroment Ug=         854     counts  time: 09:53:55  NaN



In [1963]: df.columns

Out[1963]: 

Index([u'date: 11.14.18', u'Unnamed: 1', u'Unnamed: 2', u'gate time: 190',

       u'sec'],

      dtype='object')



# the above gives you the column names. 

#You can see in `df` that the counts values  and gate_time values lie in individual columns.

So, just extract those columns from the dataframe(df):

In [1967]: df[['Unnamed: 1', 'gate time: 190']]

Out[1967]: 

   Unnamed: 1  gate time: 190

0         483  time: 09:19:55

1         777  time: 09:21:55

2         854  time: 09:53:55

Now, you can write the above in a csv file.

In [1968]: df.to_csv('/home/mayankp/Desktop/tt.csv', header=False, index=False, columns=['Unnamed: 1', 'gate time: 190'])

This approach basically saves you from using for loops and complex regex.

edited Nov 14 '18 at 14:40

answered Nov 14 '18 at 14:27

Mayank Porwal

4,9352724

Try using pandas for this:

Assuming your file to be fixed-width file with 1st record as header, you can do the following:

In [1961]: df = pd.read_fwf('t.txt')



In [1962]: df

Out[1962]: 

   date: 11.14.18  Unnamed: 1 Unnamed: 2  gate time: 190  sec

0  enviroment Ug=         483     counts  time: 09:19:55  NaN

1  enviroment Ug=         777     counts  time: 09:21:55  NaN

2  enviroment Ug=         854     counts  time: 09:53:55  NaN



In [1963]: df.columns

Out[1963]: 

Index([u'date: 11.14.18', u'Unnamed: 1', u'Unnamed: 2', u'gate time: 190',

       u'sec'],

      dtype='object')



# the above gives you the column names. 

#You can see in `df` that the counts values  and gate_time values lie in individual columns.

So, just extract those columns from the dataframe(df):

In [1967]: df[['Unnamed: 1', 'gate time: 190']]

Out[1967]: 

   Unnamed: 1  gate time: 190

0         483  time: 09:19:55

1         777  time: 09:21:55

2         854  time: 09:53:55

Now, you can write the above in a csv file.

In [1968]: df.to_csv('/home/mayankp/Desktop/tt.csv', header=False, index=False, columns=['Unnamed: 1', 'gate time: 190'])

This approach basically saves you from using for loops and complex regex.

edited Nov 14 '18 at 14:40

answered Nov 14 '18 at 14:27

Mayank Porwal

4,9352724

edited Nov 14 '18 at 14:40

answered Nov 14 '18 at 14:27

Mayank Porwal

4,9352724

answered Nov 14 '18 at 14:27

Mayank Porwal

4,9352724

answered Nov 14 '18 at 14:27

Mayank Porwal

4,9352724

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

J15oM6B4N,Qov9d69f,1vFgs3CMZ2rE QbcfM1Z5h0E5Q

搜尋此網誌

Ndtyjky