Python - Substituting regex matches in byte file












3















Struggling to automate a text file cleanup for some subsequent data analysis. I have a text to tab file where I need to remove instances of t" text (remove the " but keep the tab).



I then want to remove instances of n where the character before is nor r. i.e. rn is OK xn is not. I have the first part working but not the second part any help appreciated. I appreciate there are probably way better ways to do this given I'm writing then opening in a byte format simply because I can't seem to detect /r in 'r' mode.



import re
import sys
import time

originalFile = '14-09 - Copy.txt'
amendedFile = '14-09 - amended.txt'

with open(originalFile, 'r') as content_file:
content = content_file.read()

content = content.replace('t"','t')

with open(amendedFile,'w') as f:
f.write(content)

with open(amendedFile, 'rb') as content_file:
content = content_file.read()
content = re.sub(b"(?<!r)n","", content)

with open(amendedFile,'wb') as f:
f.write(content)

print("Done")


For clarity or completion, the python 2 code below identifies the positions that I'm interested in (I'm just looking to automate their removal now). i.e.



rnText should equal rnText



tnText should equal tText



TextnText should equal TextText



import re
import sys
import time
with open('14-09 - Copy.txt', 'rb') as content_file:
content = content_file.read()

newLinePos = [m.start() for m in re.finditer('n', content)]

for line in newLinePos:
if (content[line-1]) != 'r':
print (repr(content[line-20:line]))


Thanks as always!










share|improve this question



























    3















    Struggling to automate a text file cleanup for some subsequent data analysis. I have a text to tab file where I need to remove instances of t" text (remove the " but keep the tab).



    I then want to remove instances of n where the character before is nor r. i.e. rn is OK xn is not. I have the first part working but not the second part any help appreciated. I appreciate there are probably way better ways to do this given I'm writing then opening in a byte format simply because I can't seem to detect /r in 'r' mode.



    import re
    import sys
    import time

    originalFile = '14-09 - Copy.txt'
    amendedFile = '14-09 - amended.txt'

    with open(originalFile, 'r') as content_file:
    content = content_file.read()

    content = content.replace('t"','t')

    with open(amendedFile,'w') as f:
    f.write(content)

    with open(amendedFile, 'rb') as content_file:
    content = content_file.read()
    content = re.sub(b"(?<!r)n","", content)

    with open(amendedFile,'wb') as f:
    f.write(content)

    print("Done")


    For clarity or completion, the python 2 code below identifies the positions that I'm interested in (I'm just looking to automate their removal now). i.e.



    rnText should equal rnText



    tnText should equal tText



    TextnText should equal TextText



    import re
    import sys
    import time
    with open('14-09 - Copy.txt', 'rb') as content_file:
    content = content_file.read()

    newLinePos = [m.start() for m in re.finditer('n', content)]

    for line in newLinePos:
    if (content[line-1]) != 'r':
    print (repr(content[line-20:line]))


    Thanks as always!










    share|improve this question

























      3












      3








      3








      Struggling to automate a text file cleanup for some subsequent data analysis. I have a text to tab file where I need to remove instances of t" text (remove the " but keep the tab).



      I then want to remove instances of n where the character before is nor r. i.e. rn is OK xn is not. I have the first part working but not the second part any help appreciated. I appreciate there are probably way better ways to do this given I'm writing then opening in a byte format simply because I can't seem to detect /r in 'r' mode.



      import re
      import sys
      import time

      originalFile = '14-09 - Copy.txt'
      amendedFile = '14-09 - amended.txt'

      with open(originalFile, 'r') as content_file:
      content = content_file.read()

      content = content.replace('t"','t')

      with open(amendedFile,'w') as f:
      f.write(content)

      with open(amendedFile, 'rb') as content_file:
      content = content_file.read()
      content = re.sub(b"(?<!r)n","", content)

      with open(amendedFile,'wb') as f:
      f.write(content)

      print("Done")


      For clarity or completion, the python 2 code below identifies the positions that I'm interested in (I'm just looking to automate their removal now). i.e.



      rnText should equal rnText



      tnText should equal tText



      TextnText should equal TextText



      import re
      import sys
      import time
      with open('14-09 - Copy.txt', 'rb') as content_file:
      content = content_file.read()

      newLinePos = [m.start() for m in re.finditer('n', content)]

      for line in newLinePos:
      if (content[line-1]) != 'r':
      print (repr(content[line-20:line]))


      Thanks as always!










      share|improve this question














      Struggling to automate a text file cleanup for some subsequent data analysis. I have a text to tab file where I need to remove instances of t" text (remove the " but keep the tab).



      I then want to remove instances of n where the character before is nor r. i.e. rn is OK xn is not. I have the first part working but not the second part any help appreciated. I appreciate there are probably way better ways to do this given I'm writing then opening in a byte format simply because I can't seem to detect /r in 'r' mode.



      import re
      import sys
      import time

      originalFile = '14-09 - Copy.txt'
      amendedFile = '14-09 - amended.txt'

      with open(originalFile, 'r') as content_file:
      content = content_file.read()

      content = content.replace('t"','t')

      with open(amendedFile,'w') as f:
      f.write(content)

      with open(amendedFile, 'rb') as content_file:
      content = content_file.read()
      content = re.sub(b"(?<!r)n","", content)

      with open(amendedFile,'wb') as f:
      f.write(content)

      print("Done")


      For clarity or completion, the python 2 code below identifies the positions that I'm interested in (I'm just looking to automate their removal now). i.e.



      rnText should equal rnText



      tnText should equal tText



      TextnText should equal TextText



      import re
      import sys
      import time
      with open('14-09 - Copy.txt', 'rb') as content_file:
      content = content_file.read()

      newLinePos = [m.start() for m in re.finditer('n', content)]

      for line in newLinePos:
      if (content[line-1]) != 'r':
      print (repr(content[line-20:line]))


      Thanks as always!







      python regex python-3.x






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 14 '18 at 1:03









      BrodieBrodie

      353




      353
























          1 Answer
          1






          active

          oldest

          votes


















          1














          You probably want to use ([^r])n as your pattern, and then substitute 1 to keep the character before.



          So your line would be



          content = re.sub(b"([^r])n",r"1", content)





          share|improve this answer
























          • This is really close but I'm looking to remove the char. I can see it in notepad++ as CRLF - equiv to /r/n (as opposed to the original CR /r). I'm guessing when I write out via 'w' that it changes CR to CRLF so switched the code around although I exhibit same behaviour. Any ideas? ie detect /r not proceed by /n and remove the /r?

            – Brodie
            Nov 14 '18 at 5:25











          • Thanks - ignore last comment. This works perfectly provided I execute the write rb first as otherwise it automatically appends n as rn! I was executing one of the opens in the wrong order.

            – Brodie
            Nov 14 '18 at 5:41











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53291719%2fpython-substituting-regex-matches-in-byte-file%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1














          You probably want to use ([^r])n as your pattern, and then substitute 1 to keep the character before.



          So your line would be



          content = re.sub(b"([^r])n",r"1", content)





          share|improve this answer
























          • This is really close but I'm looking to remove the char. I can see it in notepad++ as CRLF - equiv to /r/n (as opposed to the original CR /r). I'm guessing when I write out via 'w' that it changes CR to CRLF so switched the code around although I exhibit same behaviour. Any ideas? ie detect /r not proceed by /n and remove the /r?

            – Brodie
            Nov 14 '18 at 5:25











          • Thanks - ignore last comment. This works perfectly provided I execute the write rb first as otherwise it automatically appends n as rn! I was executing one of the opens in the wrong order.

            – Brodie
            Nov 14 '18 at 5:41
















          1














          You probably want to use ([^r])n as your pattern, and then substitute 1 to keep the character before.



          So your line would be



          content = re.sub(b"([^r])n",r"1", content)





          share|improve this answer
























          • This is really close but I'm looking to remove the char. I can see it in notepad++ as CRLF - equiv to /r/n (as opposed to the original CR /r). I'm guessing when I write out via 'w' that it changes CR to CRLF so switched the code around although I exhibit same behaviour. Any ideas? ie detect /r not proceed by /n and remove the /r?

            – Brodie
            Nov 14 '18 at 5:25











          • Thanks - ignore last comment. This works perfectly provided I execute the write rb first as otherwise it automatically appends n as rn! I was executing one of the opens in the wrong order.

            – Brodie
            Nov 14 '18 at 5:41














          1












          1








          1







          You probably want to use ([^r])n as your pattern, and then substitute 1 to keep the character before.



          So your line would be



          content = re.sub(b"([^r])n",r"1", content)





          share|improve this answer













          You probably want to use ([^r])n as your pattern, and then substitute 1 to keep the character before.



          So your line would be



          content = re.sub(b"([^r])n",r"1", content)






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 14 '18 at 1:35









          B. MorrisB. Morris

          1718




          1718













          • This is really close but I'm looking to remove the char. I can see it in notepad++ as CRLF - equiv to /r/n (as opposed to the original CR /r). I'm guessing when I write out via 'w' that it changes CR to CRLF so switched the code around although I exhibit same behaviour. Any ideas? ie detect /r not proceed by /n and remove the /r?

            – Brodie
            Nov 14 '18 at 5:25











          • Thanks - ignore last comment. This works perfectly provided I execute the write rb first as otherwise it automatically appends n as rn! I was executing one of the opens in the wrong order.

            – Brodie
            Nov 14 '18 at 5:41



















          • This is really close but I'm looking to remove the char. I can see it in notepad++ as CRLF - equiv to /r/n (as opposed to the original CR /r). I'm guessing when I write out via 'w' that it changes CR to CRLF so switched the code around although I exhibit same behaviour. Any ideas? ie detect /r not proceed by /n and remove the /r?

            – Brodie
            Nov 14 '18 at 5:25











          • Thanks - ignore last comment. This works perfectly provided I execute the write rb first as otherwise it automatically appends n as rn! I was executing one of the opens in the wrong order.

            – Brodie
            Nov 14 '18 at 5:41

















          This is really close but I'm looking to remove the char. I can see it in notepad++ as CRLF - equiv to /r/n (as opposed to the original CR /r). I'm guessing when I write out via 'w' that it changes CR to CRLF so switched the code around although I exhibit same behaviour. Any ideas? ie detect /r not proceed by /n and remove the /r?

          – Brodie
          Nov 14 '18 at 5:25





          This is really close but I'm looking to remove the char. I can see it in notepad++ as CRLF - equiv to /r/n (as opposed to the original CR /r). I'm guessing when I write out via 'w' that it changes CR to CRLF so switched the code around although I exhibit same behaviour. Any ideas? ie detect /r not proceed by /n and remove the /r?

          – Brodie
          Nov 14 '18 at 5:25













          Thanks - ignore last comment. This works perfectly provided I execute the write rb first as otherwise it automatically appends n as rn! I was executing one of the opens in the wrong order.

          – Brodie
          Nov 14 '18 at 5:41





          Thanks - ignore last comment. This works perfectly provided I execute the write rb first as otherwise it automatically appends n as rn! I was executing one of the opens in the wrong order.

          – Brodie
          Nov 14 '18 at 5:41


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53291719%2fpython-substituting-regex-matches-in-byte-file%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Florida Star v. B. J. F.

          Error while running script in elastic search , gateway timeout

          Adding quotations to stringified JSON object values