Python - Substituting regex matches in byte file

Struggling to automate a text file cleanup for some subsequent data analysis. I have a text to tab file where I need to remove instances of t" text (remove the " but keep the tab).

I then want to remove instances of n where the character before is nor r. i.e. rn is OK xn is not. I have the first part working but not the second part any help appreciated. I appreciate there are probably way better ways to do this given I'm writing then opening in a byte format simply because I can't seem to detect /r in 'r' mode.

import re

import sys

import time



originalFile = '14-09 - Copy.txt'

amendedFile = '14-09 - amended.txt'



with open(originalFile, 'r') as content_file:

    content = content_file.read()



content = content.replace('t"','t')



with open(amendedFile,'w') as f:

    f.write(content)



with open(amendedFile, 'rb') as content_file:

    content = content_file.read()

content = re.sub(b"(?<!r)n","", content)



with open(amendedFile,'wb') as f:

    f.write(content)



print("Done")

For clarity or completion, the python 2 code below identifies the positions that I'm interested in (I'm just looking to automate their removal now). i.e.

rnText should equal rnText

tnText should equal tText

TextnText should equal TextText

import re

import sys

import time

with open('14-09 - Copy.txt', 'rb') as content_file:

    content = content_file.read()



newLinePos = [m.start() for m in re.finditer('n', content)]



for line in newLinePos:

    if (content[line-1]) != 'r':

        print (repr(content[line-20:line]))

Thanks as always!

asked Nov 14 '18 at 1:03

Brodie

353

add a comment |

Struggling to automate a text file cleanup for some subsequent data analysis. I have a text to tab file where I need to remove instances of t" text (remove the " but keep the tab).

import re

import sys

import time



originalFile = '14-09 - Copy.txt'

amendedFile = '14-09 - amended.txt'



with open(originalFile, 'r') as content_file:

    content = content_file.read()



content = content.replace('t"','t')



with open(amendedFile,'w') as f:

    f.write(content)



with open(amendedFile, 'rb') as content_file:

    content = content_file.read()

content = re.sub(b"(?<!r)n","", content)



with open(amendedFile,'wb') as f:

    f.write(content)



print("Done")

For clarity or completion, the python 2 code below identifies the positions that I'm interested in (I'm just looking to automate their removal now). i.e.

rnText should equal rnText

tnText should equal tText

TextnText should equal TextText

import re

import sys

import time

with open('14-09 - Copy.txt', 'rb') as content_file:

    content = content_file.read()



newLinePos = [m.start() for m in re.finditer('n', content)]



for line in newLinePos:

    if (content[line-1]) != 'r':

        print (repr(content[line-20:line]))

Thanks as always!

asked Nov 14 '18 at 1:03

Brodie

353

add a comment |

Struggling to automate a text file cleanup for some subsequent data analysis. I have a text to tab file where I need to remove instances of t" text (remove the " but keep the tab).

import re

import sys

import time



originalFile = '14-09 - Copy.txt'

amendedFile = '14-09 - amended.txt'



with open(originalFile, 'r') as content_file:

    content = content_file.read()



content = content.replace('t"','t')



with open(amendedFile,'w') as f:

    f.write(content)



with open(amendedFile, 'rb') as content_file:

    content = content_file.read()

content = re.sub(b"(?<!r)n","", content)



with open(amendedFile,'wb') as f:

    f.write(content)



print("Done")

For clarity or completion, the python 2 code below identifies the positions that I'm interested in (I'm just looking to automate their removal now). i.e.

rnText should equal rnText

tnText should equal tText

TextnText should equal TextText

import re

import sys

import time

with open('14-09 - Copy.txt', 'rb') as content_file:

    content = content_file.read()



newLinePos = [m.start() for m in re.finditer('n', content)]



for line in newLinePos:

    if (content[line-1]) != 'r':

        print (repr(content[line-20:line]))

Thanks as always!

asked Nov 14 '18 at 1:03

Brodie

353

Struggling to automate a text file cleanup for some subsequent data analysis. I have a text to tab file where I need to remove instances of t" text (remove the " but keep the tab).

import re

import sys

import time



originalFile = '14-09 - Copy.txt'

amendedFile = '14-09 - amended.txt'



with open(originalFile, 'r') as content_file:

    content = content_file.read()



content = content.replace('t"','t')



with open(amendedFile,'w') as f:

    f.write(content)



with open(amendedFile, 'rb') as content_file:

    content = content_file.read()

content = re.sub(b"(?<!r)n","", content)



with open(amendedFile,'wb') as f:

    f.write(content)



print("Done")

For clarity or completion, the python 2 code below identifies the positions that I'm interested in (I'm just looking to automate their removal now). i.e.

rnText should equal rnText

tnText should equal tText

TextnText should equal TextText

import re

import sys

import time

with open('14-09 - Copy.txt', 'rb') as content_file:

    content = content_file.read()



newLinePos = [m.start() for m in re.finditer('n', content)]



for line in newLinePos:

    if (content[line-1]) != 'r':

        print (repr(content[line-20:line]))

Thanks as always!

python regex python-3.x

asked Nov 14 '18 at 1:03

Brodie

353

asked Nov 14 '18 at 1:03

Brodie

353

asked Nov 14 '18 at 1:03

Brodie

353

asked Nov 14 '18 at 1:03

Brodie

353

asked Nov 14 '18 at 1:03

Brodie

353

add a comment |

1 Answer
1

active

oldest

votes

You probably want to use ([^r])n as your pattern, and then substitute 1 to keep the character before.

So your line would be

content = re.sub(b"([^r])n",r"1", content)

answered Nov 14 '18 at 1:35

B. Morris

1718

This is really close but I'm looking to remove the char. I can see it in notepad++ as CRLF - equiv to /r/n (as opposed to the original CR /r). I'm guessing when I write out via 'w' that it changes CR to CRLF so switched the code around although I exhibit same behaviour. Any ideas? ie detect /r not proceed by /n and remove the /r?

– Brodie
Nov 14 '18 at 5:25

Thanks - ignore last comment. This works perfectly provided I execute the write rb first as otherwise it automatically appends n as rn! I was executing one of the opens in the wrong order.

– Brodie
Nov 14 '18 at 5:41

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53291719%2fpython-substituting-regex-matches-in-byte-file%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

You probably want to use ([^r])n as your pattern, and then substitute 1 to keep the character before.

So your line would be

content = re.sub(b"([^r])n",r"1", content)

answered Nov 14 '18 at 1:35

B. Morris

1718

This is really close but I'm looking to remove the char. I can see it in notepad++ as CRLF - equiv to /r/n (as opposed to the original CR /r). I'm guessing when I write out via 'w' that it changes CR to CRLF so switched the code around although I exhibit same behaviour. Any ideas? ie detect /r not proceed by /n and remove the /r?

– Brodie
Nov 14 '18 at 5:25

Thanks - ignore last comment. This works perfectly provided I execute the write rb first as otherwise it automatically appends n as rn! I was executing one of the opens in the wrong order.

– Brodie
Nov 14 '18 at 5:41

add a comment |

You probably want to use ([^r])n as your pattern, and then substitute 1 to keep the character before.

So your line would be

content = re.sub(b"([^r])n",r"1", content)

answered Nov 14 '18 at 1:35

B. Morris

1718

This is really close but I'm looking to remove the char. I can see it in notepad++ as CRLF - equiv to /r/n (as opposed to the original CR /r). I'm guessing when I write out via 'w' that it changes CR to CRLF so switched the code around although I exhibit same behaviour. Any ideas? ie detect /r not proceed by /n and remove the /r?

– Brodie
Nov 14 '18 at 5:25

Thanks - ignore last comment. This works perfectly provided I execute the write rb first as otherwise it automatically appends n as rn! I was executing one of the opens in the wrong order.

– Brodie
Nov 14 '18 at 5:41

add a comment |

You probably want to use ([^r])n as your pattern, and then substitute 1 to keep the character before.

So your line would be

content = re.sub(b"([^r])n",r"1", content)

answered Nov 14 '18 at 1:35

B. Morris

1718

You probably want to use ([^r])n as your pattern, and then substitute 1 to keep the character before.

So your line would be

content = re.sub(b"([^r])n",r"1", content)

answered Nov 14 '18 at 1:35

B. Morris

1718

answered Nov 14 '18 at 1:35

B. Morris

1718

answered Nov 14 '18 at 1:35

B. Morris

1718

answered Nov 14 '18 at 1:35

B. Morris

1718

This is really close but I'm looking to remove the char. I can see it in notepad++ as CRLF - equiv to /r/n (as opposed to the original CR /r). I'm guessing when I write out via 'w' that it changes CR to CRLF so switched the code around although I exhibit same behaviour. Any ideas? ie detect /r not proceed by /n and remove the /r?

– Brodie
Nov 14 '18 at 5:25

Thanks - ignore last comment. This works perfectly provided I execute the write rb first as otherwise it automatically appends n as rn! I was executing one of the opens in the wrong order.

– Brodie
Nov 14 '18 at 5:41

add a comment |

This is really close but I'm looking to remove the char. I can see it in notepad++ as CRLF - equiv to /r/n (as opposed to the original CR /r). I'm guessing when I write out via 'w' that it changes CR to CRLF so switched the code around although I exhibit same behaviour. Any ideas? ie detect /r not proceed by /n and remove the /r?

– Brodie
Nov 14 '18 at 5:25

Thanks - ignore last comment. This works perfectly provided I execute the write rb first as otherwise it automatically appends n as rn! I was executing one of the opens in the wrong order.

– Brodie
Nov 14 '18 at 5:41

This is really close but I'm looking to remove the char. I can see it in notepad++ as CRLF - equiv to /r/n (as opposed to the original CR /r). I'm guessing when I write out via 'w' that it changes CR to CRLF so switched the code around although I exhibit same behaviour. Any ideas? ie detect /r not proceed by /n and remove the /r?

– Brodie
Nov 14 '18 at 5:25

Thanks - ignore last comment. This works perfectly provided I execute the write rb first as otherwise it automatically appends n as rn! I was executing one of the opens in the wrong order.

– Brodie
Nov 14 '18 at 5:41

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ndtyjky