Extracting relevant data from a txt file












0















I know how to extract data from a .txt file if it has a certain format (columns with certain spacing) using numpy.loadtxt, but I'm facing currently a problem a bit more complicating. Let's say a have data of the following format:



*** model xy ***    
date: 11.14.18 gate time: 190 sec
enviroment Ug= 483 counts time: 09:19:55
enviroment Ug= 777 counts time: 09:21:55
enviroment Ug= 854 counts time: 09:53:55
.
.
.


The relevant information for me are the counts and the gate time. I know I can use open("some txt file", "r") to read in a txt file, but I don't know how to remove the useless information of each line.










share|improve this question























  • Possible duplicate of How to efficiently parse fixed width files?

    – cha0site
    Nov 14 '18 at 14:23











  • Is gate time only in one line? or all of the times are gate times too?

    – Muhammad Ahmad
    Nov 14 '18 at 14:29











  • @MuhammadAhmad gate time is only in the first line, the orher times are the moments when the measurement was finished and for me irrelevant.

    – Sito
    Nov 14 '18 at 14:30
















0















I know how to extract data from a .txt file if it has a certain format (columns with certain spacing) using numpy.loadtxt, but I'm facing currently a problem a bit more complicating. Let's say a have data of the following format:



*** model xy ***    
date: 11.14.18 gate time: 190 sec
enviroment Ug= 483 counts time: 09:19:55
enviroment Ug= 777 counts time: 09:21:55
enviroment Ug= 854 counts time: 09:53:55
.
.
.


The relevant information for me are the counts and the gate time. I know I can use open("some txt file", "r") to read in a txt file, but I don't know how to remove the useless information of each line.










share|improve this question























  • Possible duplicate of How to efficiently parse fixed width files?

    – cha0site
    Nov 14 '18 at 14:23











  • Is gate time only in one line? or all of the times are gate times too?

    – Muhammad Ahmad
    Nov 14 '18 at 14:29











  • @MuhammadAhmad gate time is only in the first line, the orher times are the moments when the measurement was finished and for me irrelevant.

    – Sito
    Nov 14 '18 at 14:30














0












0








0








I know how to extract data from a .txt file if it has a certain format (columns with certain spacing) using numpy.loadtxt, but I'm facing currently a problem a bit more complicating. Let's say a have data of the following format:



*** model xy ***    
date: 11.14.18 gate time: 190 sec
enviroment Ug= 483 counts time: 09:19:55
enviroment Ug= 777 counts time: 09:21:55
enviroment Ug= 854 counts time: 09:53:55
.
.
.


The relevant information for me are the counts and the gate time. I know I can use open("some txt file", "r") to read in a txt file, but I don't know how to remove the useless information of each line.










share|improve this question














I know how to extract data from a .txt file if it has a certain format (columns with certain spacing) using numpy.loadtxt, but I'm facing currently a problem a bit more complicating. Let's say a have data of the following format:



*** model xy ***    
date: 11.14.18 gate time: 190 sec
enviroment Ug= 483 counts time: 09:19:55
enviroment Ug= 777 counts time: 09:21:55
enviroment Ug= 854 counts time: 09:53:55
.
.
.


The relevant information for me are the counts and the gate time. I know I can use open("some txt file", "r") to read in a txt file, but I don't know how to remove the useless information of each line.







python






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 14 '18 at 14:17









SitoSito

192212




192212













  • Possible duplicate of How to efficiently parse fixed width files?

    – cha0site
    Nov 14 '18 at 14:23











  • Is gate time only in one line? or all of the times are gate times too?

    – Muhammad Ahmad
    Nov 14 '18 at 14:29











  • @MuhammadAhmad gate time is only in the first line, the orher times are the moments when the measurement was finished and for me irrelevant.

    – Sito
    Nov 14 '18 at 14:30



















  • Possible duplicate of How to efficiently parse fixed width files?

    – cha0site
    Nov 14 '18 at 14:23











  • Is gate time only in one line? or all of the times are gate times too?

    – Muhammad Ahmad
    Nov 14 '18 at 14:29











  • @MuhammadAhmad gate time is only in the first line, the orher times are the moments when the measurement was finished and for me irrelevant.

    – Sito
    Nov 14 '18 at 14:30

















Possible duplicate of How to efficiently parse fixed width files?

– cha0site
Nov 14 '18 at 14:23





Possible duplicate of How to efficiently parse fixed width files?

– cha0site
Nov 14 '18 at 14:23













Is gate time only in one line? or all of the times are gate times too?

– Muhammad Ahmad
Nov 14 '18 at 14:29





Is gate time only in one line? or all of the times are gate times too?

– Muhammad Ahmad
Nov 14 '18 at 14:29













@MuhammadAhmad gate time is only in the first line, the orher times are the moments when the measurement was finished and for me irrelevant.

– Sito
Nov 14 '18 at 14:30





@MuhammadAhmad gate time is only in the first line, the orher times are the moments when the measurement was finished and for me irrelevant.

– Sito
Nov 14 '18 at 14:30












3 Answers
3






active

oldest

votes


















1














You can simply read all of the text from the file at once, and find the required data with a regex:



import re
with open("some txt file", "r") as fin:
all_text = fin.read()

# Find the gate time
gate_time_r = re.compile(r'gates+time:s+(d+)', re.IGNORECASE)
gate_time = int(gate_time_r.search(all_text).groups()[0])

# Find the counts
counts_r = re.compile(r'enviroments+ug=s+(d+)', re.IGNORECASE)
counts_list = list(map(int, counts_r.findall(all_text)))


Gate time regex: gates+time:s+(d+) simply matches a pattern where there comes a number after string gate time:, and matches that number in a group. And you can simply run this regex with gate_time_r.search(all_text), it will find a match, and you can pick its first group.



Counts regex: enviroments+ug=s+(d+). It matches a pattern where tehre comes a number after enciroment ug=, and picks that number in a group.



As there are more than one matches in the all_text string for this, you can use findall to search for all of the matches.



It will return a list of groups present in the regex, so it will be the list of actual counts. Simply cast it to int if you want.






share|improve this answer

































    1














    You need to read txt line by line you can use readlines() for that purpose. For each line that starts from 2th row you can split string



    "enviroment Ug=    483 counts        time: 09:19:55".split()


    this will result with



    ['enviroment', 'Ug=', '483', 'counts', 'time:', '09:19:55']


    you can access [2] and [-1] element to get informations that you need






    share|improve this answer































      1














      Try using pandas for this:



      Assuming your file to be fixed-width file with 1st record as header, you can do the following:



      In [1961]: df = pd.read_fwf('t.txt')

      In [1962]: df
      Out[1962]:
      date: 11.14.18 Unnamed: 1 Unnamed: 2 gate time: 190 sec
      0 enviroment Ug= 483 counts time: 09:19:55 NaN
      1 enviroment Ug= 777 counts time: 09:21:55 NaN
      2 enviroment Ug= 854 counts time: 09:53:55 NaN

      In [1963]: df.columns
      Out[1963]:
      Index([u'date: 11.14.18', u'Unnamed: 1', u'Unnamed: 2', u'gate time: 190',
      u'sec'],
      dtype='object')

      # the above gives you the column names.
      #You can see in `df` that the counts values and gate_time values lie in individual columns.


      So, just extract those columns from the dataframe(df):



      In [1967]: df[['Unnamed: 1', 'gate time: 190']]
      Out[1967]:
      Unnamed: 1 gate time: 190
      0 483 time: 09:19:55
      1 777 time: 09:21:55
      2 854 time: 09:53:55


      Now, you can write the above in a csv file.



      In [1968]: df.to_csv('/home/mayankp/Desktop/tt.csv', header=False, index=False, columns=['Unnamed: 1', 'gate time: 190'])


      This approach basically saves you from using for loops and complex regex.






      share|improve this answer

























        Your Answer






        StackExchange.ifUsing("editor", function () {
        StackExchange.using("externalEditor", function () {
        StackExchange.using("snippets", function () {
        StackExchange.snippets.init();
        });
        });
        }, "code-snippets");

        StackExchange.ready(function() {
        var channelOptions = {
        tags: "".split(" "),
        id: "1"
        };
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function() {
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled) {
        StackExchange.using("snippets", function() {
        createEditor();
        });
        }
        else {
        createEditor();
        }
        });

        function createEditor() {
        StackExchange.prepareEditor({
        heartbeatType: 'answer',
        autoActivateHeartbeat: false,
        convertImagesToLinks: true,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: 10,
        bindNavPrevention: true,
        postfix: "",
        imageUploader: {
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        },
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        });


        }
        });














        draft saved

        draft discarded


















        StackExchange.ready(
        function () {
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53302310%2fextracting-relevant-data-from-a-txt-file%23new-answer', 'question_page');
        }
        );

        Post as a guest















        Required, but never shown

























        3 Answers
        3






        active

        oldest

        votes








        3 Answers
        3






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        1














        You can simply read all of the text from the file at once, and find the required data with a regex:



        import re
        with open("some txt file", "r") as fin:
        all_text = fin.read()

        # Find the gate time
        gate_time_r = re.compile(r'gates+time:s+(d+)', re.IGNORECASE)
        gate_time = int(gate_time_r.search(all_text).groups()[0])

        # Find the counts
        counts_r = re.compile(r'enviroments+ug=s+(d+)', re.IGNORECASE)
        counts_list = list(map(int, counts_r.findall(all_text)))


        Gate time regex: gates+time:s+(d+) simply matches a pattern where there comes a number after string gate time:, and matches that number in a group. And you can simply run this regex with gate_time_r.search(all_text), it will find a match, and you can pick its first group.



        Counts regex: enviroments+ug=s+(d+). It matches a pattern where tehre comes a number after enciroment ug=, and picks that number in a group.



        As there are more than one matches in the all_text string for this, you can use findall to search for all of the matches.



        It will return a list of groups present in the regex, so it will be the list of actual counts. Simply cast it to int if you want.






        share|improve this answer






























          1














          You can simply read all of the text from the file at once, and find the required data with a regex:



          import re
          with open("some txt file", "r") as fin:
          all_text = fin.read()

          # Find the gate time
          gate_time_r = re.compile(r'gates+time:s+(d+)', re.IGNORECASE)
          gate_time = int(gate_time_r.search(all_text).groups()[0])

          # Find the counts
          counts_r = re.compile(r'enviroments+ug=s+(d+)', re.IGNORECASE)
          counts_list = list(map(int, counts_r.findall(all_text)))


          Gate time regex: gates+time:s+(d+) simply matches a pattern where there comes a number after string gate time:, and matches that number in a group. And you can simply run this regex with gate_time_r.search(all_text), it will find a match, and you can pick its first group.



          Counts regex: enviroments+ug=s+(d+). It matches a pattern where tehre comes a number after enciroment ug=, and picks that number in a group.



          As there are more than one matches in the all_text string for this, you can use findall to search for all of the matches.



          It will return a list of groups present in the regex, so it will be the list of actual counts. Simply cast it to int if you want.






          share|improve this answer




























            1












            1








            1







            You can simply read all of the text from the file at once, and find the required data with a regex:



            import re
            with open("some txt file", "r") as fin:
            all_text = fin.read()

            # Find the gate time
            gate_time_r = re.compile(r'gates+time:s+(d+)', re.IGNORECASE)
            gate_time = int(gate_time_r.search(all_text).groups()[0])

            # Find the counts
            counts_r = re.compile(r'enviroments+ug=s+(d+)', re.IGNORECASE)
            counts_list = list(map(int, counts_r.findall(all_text)))


            Gate time regex: gates+time:s+(d+) simply matches a pattern where there comes a number after string gate time:, and matches that number in a group. And you can simply run this regex with gate_time_r.search(all_text), it will find a match, and you can pick its first group.



            Counts regex: enviroments+ug=s+(d+). It matches a pattern where tehre comes a number after enciroment ug=, and picks that number in a group.



            As there are more than one matches in the all_text string for this, you can use findall to search for all of the matches.



            It will return a list of groups present in the regex, so it will be the list of actual counts. Simply cast it to int if you want.






            share|improve this answer















            You can simply read all of the text from the file at once, and find the required data with a regex:



            import re
            with open("some txt file", "r") as fin:
            all_text = fin.read()

            # Find the gate time
            gate_time_r = re.compile(r'gates+time:s+(d+)', re.IGNORECASE)
            gate_time = int(gate_time_r.search(all_text).groups()[0])

            # Find the counts
            counts_r = re.compile(r'enviroments+ug=s+(d+)', re.IGNORECASE)
            counts_list = list(map(int, counts_r.findall(all_text)))


            Gate time regex: gates+time:s+(d+) simply matches a pattern where there comes a number after string gate time:, and matches that number in a group. And you can simply run this regex with gate_time_r.search(all_text), it will find a match, and you can pick its first group.



            Counts regex: enviroments+ug=s+(d+). It matches a pattern where tehre comes a number after enciroment ug=, and picks that number in a group.



            As there are more than one matches in the all_text string for this, you can use findall to search for all of the matches.



            It will return a list of groups present in the regex, so it will be the list of actual counts. Simply cast it to int if you want.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Nov 14 '18 at 14:42

























            answered Nov 14 '18 at 14:33









            Muhammad AhmadMuhammad Ahmad

            2,1321422




            2,1321422

























                1














                You need to read txt line by line you can use readlines() for that purpose. For each line that starts from 2th row you can split string



                "enviroment Ug=    483 counts        time: 09:19:55".split()


                this will result with



                ['enviroment', 'Ug=', '483', 'counts', 'time:', '09:19:55']


                you can access [2] and [-1] element to get informations that you need






                share|improve this answer




























                  1














                  You need to read txt line by line you can use readlines() for that purpose. For each line that starts from 2th row you can split string



                  "enviroment Ug=    483 counts        time: 09:19:55".split()


                  this will result with



                  ['enviroment', 'Ug=', '483', 'counts', 'time:', '09:19:55']


                  you can access [2] and [-1] element to get informations that you need






                  share|improve this answer


























                    1












                    1








                    1







                    You need to read txt line by line you can use readlines() for that purpose. For each line that starts from 2th row you can split string



                    "enviroment Ug=    483 counts        time: 09:19:55".split()


                    this will result with



                    ['enviroment', 'Ug=', '483', 'counts', 'time:', '09:19:55']


                    you can access [2] and [-1] element to get informations that you need






                    share|improve this answer













                    You need to read txt line by line you can use readlines() for that purpose. For each line that starts from 2th row you can split string



                    "enviroment Ug=    483 counts        time: 09:19:55".split()


                    this will result with



                    ['enviroment', 'Ug=', '483', 'counts', 'time:', '09:19:55']


                    you can access [2] and [-1] element to get informations that you need







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Nov 14 '18 at 14:24









                    İhsan Cemil Çiçekİhsan Cemil Çiçek

                    13111




                    13111























                        1














                        Try using pandas for this:



                        Assuming your file to be fixed-width file with 1st record as header, you can do the following:



                        In [1961]: df = pd.read_fwf('t.txt')

                        In [1962]: df
                        Out[1962]:
                        date: 11.14.18 Unnamed: 1 Unnamed: 2 gate time: 190 sec
                        0 enviroment Ug= 483 counts time: 09:19:55 NaN
                        1 enviroment Ug= 777 counts time: 09:21:55 NaN
                        2 enviroment Ug= 854 counts time: 09:53:55 NaN

                        In [1963]: df.columns
                        Out[1963]:
                        Index([u'date: 11.14.18', u'Unnamed: 1', u'Unnamed: 2', u'gate time: 190',
                        u'sec'],
                        dtype='object')

                        # the above gives you the column names.
                        #You can see in `df` that the counts values and gate_time values lie in individual columns.


                        So, just extract those columns from the dataframe(df):



                        In [1967]: df[['Unnamed: 1', 'gate time: 190']]
                        Out[1967]:
                        Unnamed: 1 gate time: 190
                        0 483 time: 09:19:55
                        1 777 time: 09:21:55
                        2 854 time: 09:53:55


                        Now, you can write the above in a csv file.



                        In [1968]: df.to_csv('/home/mayankp/Desktop/tt.csv', header=False, index=False, columns=['Unnamed: 1', 'gate time: 190'])


                        This approach basically saves you from using for loops and complex regex.






                        share|improve this answer






























                          1














                          Try using pandas for this:



                          Assuming your file to be fixed-width file with 1st record as header, you can do the following:



                          In [1961]: df = pd.read_fwf('t.txt')

                          In [1962]: df
                          Out[1962]:
                          date: 11.14.18 Unnamed: 1 Unnamed: 2 gate time: 190 sec
                          0 enviroment Ug= 483 counts time: 09:19:55 NaN
                          1 enviroment Ug= 777 counts time: 09:21:55 NaN
                          2 enviroment Ug= 854 counts time: 09:53:55 NaN

                          In [1963]: df.columns
                          Out[1963]:
                          Index([u'date: 11.14.18', u'Unnamed: 1', u'Unnamed: 2', u'gate time: 190',
                          u'sec'],
                          dtype='object')

                          # the above gives you the column names.
                          #You can see in `df` that the counts values and gate_time values lie in individual columns.


                          So, just extract those columns from the dataframe(df):



                          In [1967]: df[['Unnamed: 1', 'gate time: 190']]
                          Out[1967]:
                          Unnamed: 1 gate time: 190
                          0 483 time: 09:19:55
                          1 777 time: 09:21:55
                          2 854 time: 09:53:55


                          Now, you can write the above in a csv file.



                          In [1968]: df.to_csv('/home/mayankp/Desktop/tt.csv', header=False, index=False, columns=['Unnamed: 1', 'gate time: 190'])


                          This approach basically saves you from using for loops and complex regex.






                          share|improve this answer




























                            1












                            1








                            1







                            Try using pandas for this:



                            Assuming your file to be fixed-width file with 1st record as header, you can do the following:



                            In [1961]: df = pd.read_fwf('t.txt')

                            In [1962]: df
                            Out[1962]:
                            date: 11.14.18 Unnamed: 1 Unnamed: 2 gate time: 190 sec
                            0 enviroment Ug= 483 counts time: 09:19:55 NaN
                            1 enviroment Ug= 777 counts time: 09:21:55 NaN
                            2 enviroment Ug= 854 counts time: 09:53:55 NaN

                            In [1963]: df.columns
                            Out[1963]:
                            Index([u'date: 11.14.18', u'Unnamed: 1', u'Unnamed: 2', u'gate time: 190',
                            u'sec'],
                            dtype='object')

                            # the above gives you the column names.
                            #You can see in `df` that the counts values and gate_time values lie in individual columns.


                            So, just extract those columns from the dataframe(df):



                            In [1967]: df[['Unnamed: 1', 'gate time: 190']]
                            Out[1967]:
                            Unnamed: 1 gate time: 190
                            0 483 time: 09:19:55
                            1 777 time: 09:21:55
                            2 854 time: 09:53:55


                            Now, you can write the above in a csv file.



                            In [1968]: df.to_csv('/home/mayankp/Desktop/tt.csv', header=False, index=False, columns=['Unnamed: 1', 'gate time: 190'])


                            This approach basically saves you from using for loops and complex regex.






                            share|improve this answer















                            Try using pandas for this:



                            Assuming your file to be fixed-width file with 1st record as header, you can do the following:



                            In [1961]: df = pd.read_fwf('t.txt')

                            In [1962]: df
                            Out[1962]:
                            date: 11.14.18 Unnamed: 1 Unnamed: 2 gate time: 190 sec
                            0 enviroment Ug= 483 counts time: 09:19:55 NaN
                            1 enviroment Ug= 777 counts time: 09:21:55 NaN
                            2 enviroment Ug= 854 counts time: 09:53:55 NaN

                            In [1963]: df.columns
                            Out[1963]:
                            Index([u'date: 11.14.18', u'Unnamed: 1', u'Unnamed: 2', u'gate time: 190',
                            u'sec'],
                            dtype='object')

                            # the above gives you the column names.
                            #You can see in `df` that the counts values and gate_time values lie in individual columns.


                            So, just extract those columns from the dataframe(df):



                            In [1967]: df[['Unnamed: 1', 'gate time: 190']]
                            Out[1967]:
                            Unnamed: 1 gate time: 190
                            0 483 time: 09:19:55
                            1 777 time: 09:21:55
                            2 854 time: 09:53:55


                            Now, you can write the above in a csv file.



                            In [1968]: df.to_csv('/home/mayankp/Desktop/tt.csv', header=False, index=False, columns=['Unnamed: 1', 'gate time: 190'])


                            This approach basically saves you from using for loops and complex regex.







                            share|improve this answer














                            share|improve this answer



                            share|improve this answer








                            edited Nov 14 '18 at 14:40

























                            answered Nov 14 '18 at 14:27









                            Mayank PorwalMayank Porwal

                            4,9352724




                            4,9352724






























                                draft saved

                                draft discarded




















































                                Thanks for contributing an answer to Stack Overflow!


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid



                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.


                                To learn more, see our tips on writing great answers.




                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function () {
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53302310%2fextracting-relevant-data-from-a-txt-file%23new-answer', 'question_page');
                                }
                                );

                                Post as a guest















                                Required, but never shown





















































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown

































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown







                                Popular posts from this blog

                                Florida Star v. B. J. F.

                                Danny Elfman

                                Retrieve a Users Dashboard in Tumblr with R and TumblR. Oauth Issues