python - html - how to modify code by converting text outside of a tag into a tag












2















How to replace/convert/correct a string representing tag into a tag?



I have below example where I need to clean some parts of the code and need to convert strings like </div> into the proper tags



html = """
<html>
<body>
<div>
&lt;/div&gt; <----- how to convert the line into </div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>
"""


I tried



soup = BeautifulSoup(html,"lxml")

tag = soup.find(text="&lt;")
tag.replace_with("<")

print(soup.prettify())


but this logic doesn't work, the find function doesn't pick up the string. The fact that the text is outside of any tag makes it more difficult. How can this be achieved?










share|improve this question























  • Did you try: soup.find(text="<")? The string was encoded in the original HTML, but BeautifulSoup should have decoded them when parsing and therefore used the decoded version for matching find.

    – Lie Ryan
    Nov 16 '18 at 5:50
















2















How to replace/convert/correct a string representing tag into a tag?



I have below example where I need to clean some parts of the code and need to convert strings like &lt;/div&gt; into the proper tags



html = """
<html>
<body>
<div>
&lt;/div&gt; <----- how to convert the line into </div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>
"""


I tried



soup = BeautifulSoup(html,"lxml")

tag = soup.find(text="&lt;")
tag.replace_with("<")

print(soup.prettify())


but this logic doesn't work, the find function doesn't pick up the string. The fact that the text is outside of any tag makes it more difficult. How can this be achieved?










share|improve this question























  • Did you try: soup.find(text="<")? The string was encoded in the original HTML, but BeautifulSoup should have decoded them when parsing and therefore used the decoded version for matching find.

    – Lie Ryan
    Nov 16 '18 at 5:50














2












2








2








How to replace/convert/correct a string representing tag into a tag?



I have below example where I need to clean some parts of the code and need to convert strings like &lt;/div&gt; into the proper tags



html = """
<html>
<body>
<div>
&lt;/div&gt; <----- how to convert the line into </div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>
"""


I tried



soup = BeautifulSoup(html,"lxml")

tag = soup.find(text="&lt;")
tag.replace_with("<")

print(soup.prettify())


but this logic doesn't work, the find function doesn't pick up the string. The fact that the text is outside of any tag makes it more difficult. How can this be achieved?










share|improve this question














How to replace/convert/correct a string representing tag into a tag?



I have below example where I need to clean some parts of the code and need to convert strings like &lt;/div&gt; into the proper tags



html = """
<html>
<body>
<div>
&lt;/div&gt; <----- how to convert the line into </div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>
"""


I tried



soup = BeautifulSoup(html,"lxml")

tag = soup.find(text="&lt;")
tag.replace_with("<")

print(soup.prettify())


but this logic doesn't work, the find function doesn't pick up the string. The fact that the text is outside of any tag makes it more difficult. How can this be achieved?







python html beautifulsoup






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 16 '18 at 2:30









ChrisChris

342213




342213













  • Did you try: soup.find(text="<")? The string was encoded in the original HTML, but BeautifulSoup should have decoded them when parsing and therefore used the decoded version for matching find.

    – Lie Ryan
    Nov 16 '18 at 5:50



















  • Did you try: soup.find(text="<")? The string was encoded in the original HTML, but BeautifulSoup should have decoded them when parsing and therefore used the decoded version for matching find.

    – Lie Ryan
    Nov 16 '18 at 5:50

















Did you try: soup.find(text="<")? The string was encoded in the original HTML, but BeautifulSoup should have decoded them when parsing and therefore used the decoded version for matching find.

– Lie Ryan
Nov 16 '18 at 5:50





Did you try: soup.find(text="<")? The string was encoded in the original HTML, but BeautifulSoup should have decoded them when parsing and therefore used the decoded version for matching find.

– Lie Ryan
Nov 16 '18 at 5:50












3 Answers
3






active

oldest

votes


















1














Using str.replace



In [3]: print(html.replace('&lt;', '<').replace('&gt;', '>'))

<html>
<body>
<div>
</div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>


To place into BeautifulSoup from a file. Open the file first, replace the malformed text and then load the contents to BeautifulSoup. Something like this



with open('malformed.html') as f:
malformed = f.read()

html = malformed.replace('&lt;', '<').replace('&gt;', '>')

soup = bs4.BeautifulSoup(html)





share|improve this answer


























  • @ aydow that works on the self contained example, however when I load the html from a file into beautifulsoup first and then try to replace I get an error 'NoneType' object is not callable. Do you know how to work around that?

    – Chris
    Nov 17 '18 at 0:54











  • @Chris see updated answer

    – aydow
    Nov 17 '18 at 23:01



















1














I think you need a function to decode them, such as unescape on html.parser.



from html.parser import HTMLParser

unescape = HTMLParser().unescape
html = """
<html>
<body>
<div>
&lt;/div&gt; <----- how to convert the line into </div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>
"""

print(unescape(html))


Output



<html>
<body>
<div>
</div> <----- how to convert the line into </div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>





share|improve this answer
























  • @ kcorlidy, the logic doesn't work in case the html part is first parsed into BeautifulSoup like html=(, BeautifulSoup(open('C:\FolderTest.html'), 'html.parser')) error: a bytes-like object is required, not 'str'

    – Chris
    Nov 17 '18 at 0:01






  • 1





    i ran html=(BeautifulSoup(open('C:\FolderTest.html'), 'html.parser')) but i did not get such error. If you want to read as bytes, use open('C:\FolderTest.html','rb'). Btw you must close the file when you reading finished. Use with open('Test.html',"rb") as fd: html = BeautifulSoup(fd.read(), 'html.parser')

    – kcorlidy
    Nov 17 '18 at 2:07





















0














Try using regular expressions instead.



Something like:



html = re.sub("&lt;", "<", html)


for less-than and



html = re.sub("&gt;", ">", html)


for greater-than.



Make sure you import re first.



Edit: for reference on how to use re.sub - https://lzone.de/examples/Python%20re.sub



Edit2: After some further research it seems like str.replace() is faster, so you may want to use that instead.






share|improve this answer


























  • @ jwoff, doesn't work in case the html is loaded from the file into BeautifulSoup first, do some operations and then try to replace at the end. I tried to convert the Beautifulsoap object into the string like str(html) and replace and convert back to beautifulsoup for nice output format, but there were some small unexpected changes to the structure of the code.

    – Chris
    Nov 17 '18 at 1:01













Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53330609%2fpython-html-how-to-modify-code-by-converting-text-outside-of-a-tag-into-a-ta%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























3 Answers
3






active

oldest

votes








3 Answers
3






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














Using str.replace



In [3]: print(html.replace('&lt;', '<').replace('&gt;', '>'))

<html>
<body>
<div>
</div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>


To place into BeautifulSoup from a file. Open the file first, replace the malformed text and then load the contents to BeautifulSoup. Something like this



with open('malformed.html') as f:
malformed = f.read()

html = malformed.replace('&lt;', '<').replace('&gt;', '>')

soup = bs4.BeautifulSoup(html)





share|improve this answer


























  • @ aydow that works on the self contained example, however when I load the html from a file into beautifulsoup first and then try to replace I get an error 'NoneType' object is not callable. Do you know how to work around that?

    – Chris
    Nov 17 '18 at 0:54











  • @Chris see updated answer

    – aydow
    Nov 17 '18 at 23:01
















1














Using str.replace



In [3]: print(html.replace('&lt;', '<').replace('&gt;', '>'))

<html>
<body>
<div>
</div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>


To place into BeautifulSoup from a file. Open the file first, replace the malformed text and then load the contents to BeautifulSoup. Something like this



with open('malformed.html') as f:
malformed = f.read()

html = malformed.replace('&lt;', '<').replace('&gt;', '>')

soup = bs4.BeautifulSoup(html)





share|improve this answer


























  • @ aydow that works on the self contained example, however when I load the html from a file into beautifulsoup first and then try to replace I get an error 'NoneType' object is not callable. Do you know how to work around that?

    – Chris
    Nov 17 '18 at 0:54











  • @Chris see updated answer

    – aydow
    Nov 17 '18 at 23:01














1












1








1







Using str.replace



In [3]: print(html.replace('&lt;', '<').replace('&gt;', '>'))

<html>
<body>
<div>
</div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>


To place into BeautifulSoup from a file. Open the file first, replace the malformed text and then load the contents to BeautifulSoup. Something like this



with open('malformed.html') as f:
malformed = f.read()

html = malformed.replace('&lt;', '<').replace('&gt;', '>')

soup = bs4.BeautifulSoup(html)





share|improve this answer















Using str.replace



In [3]: print(html.replace('&lt;', '<').replace('&gt;', '>'))

<html>
<body>
<div>
</div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>


To place into BeautifulSoup from a file. Open the file first, replace the malformed text and then load the contents to BeautifulSoup. Something like this



with open('malformed.html') as f:
malformed = f.read()

html = malformed.replace('&lt;', '<').replace('&gt;', '>')

soup = bs4.BeautifulSoup(html)






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 17 '18 at 23:01

























answered Nov 16 '18 at 3:14









aydowaydow

2,45511127




2,45511127













  • @ aydow that works on the self contained example, however when I load the html from a file into beautifulsoup first and then try to replace I get an error 'NoneType' object is not callable. Do you know how to work around that?

    – Chris
    Nov 17 '18 at 0:54











  • @Chris see updated answer

    – aydow
    Nov 17 '18 at 23:01



















  • @ aydow that works on the self contained example, however when I load the html from a file into beautifulsoup first and then try to replace I get an error 'NoneType' object is not callable. Do you know how to work around that?

    – Chris
    Nov 17 '18 at 0:54











  • @Chris see updated answer

    – aydow
    Nov 17 '18 at 23:01

















@ aydow that works on the self contained example, however when I load the html from a file into beautifulsoup first and then try to replace I get an error 'NoneType' object is not callable. Do you know how to work around that?

– Chris
Nov 17 '18 at 0:54





@ aydow that works on the self contained example, however when I load the html from a file into beautifulsoup first and then try to replace I get an error 'NoneType' object is not callable. Do you know how to work around that?

– Chris
Nov 17 '18 at 0:54













@Chris see updated answer

– aydow
Nov 17 '18 at 23:01





@Chris see updated answer

– aydow
Nov 17 '18 at 23:01













1














I think you need a function to decode them, such as unescape on html.parser.



from html.parser import HTMLParser

unescape = HTMLParser().unescape
html = """
<html>
<body>
<div>
&lt;/div&gt; <----- how to convert the line into </div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>
"""

print(unescape(html))


Output



<html>
<body>
<div>
</div> <----- how to convert the line into </div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>





share|improve this answer
























  • @ kcorlidy, the logic doesn't work in case the html part is first parsed into BeautifulSoup like html=(, BeautifulSoup(open('C:\FolderTest.html'), 'html.parser')) error: a bytes-like object is required, not 'str'

    – Chris
    Nov 17 '18 at 0:01






  • 1





    i ran html=(BeautifulSoup(open('C:\FolderTest.html'), 'html.parser')) but i did not get such error. If you want to read as bytes, use open('C:\FolderTest.html','rb'). Btw you must close the file when you reading finished. Use with open('Test.html',"rb") as fd: html = BeautifulSoup(fd.read(), 'html.parser')

    – kcorlidy
    Nov 17 '18 at 2:07


















1














I think you need a function to decode them, such as unescape on html.parser.



from html.parser import HTMLParser

unescape = HTMLParser().unescape
html = """
<html>
<body>
<div>
&lt;/div&gt; <----- how to convert the line into </div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>
"""

print(unescape(html))


Output



<html>
<body>
<div>
</div> <----- how to convert the line into </div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>





share|improve this answer
























  • @ kcorlidy, the logic doesn't work in case the html part is first parsed into BeautifulSoup like html=(, BeautifulSoup(open('C:\FolderTest.html'), 'html.parser')) error: a bytes-like object is required, not 'str'

    – Chris
    Nov 17 '18 at 0:01






  • 1





    i ran html=(BeautifulSoup(open('C:\FolderTest.html'), 'html.parser')) but i did not get such error. If you want to read as bytes, use open('C:\FolderTest.html','rb'). Btw you must close the file when you reading finished. Use with open('Test.html',"rb") as fd: html = BeautifulSoup(fd.read(), 'html.parser')

    – kcorlidy
    Nov 17 '18 at 2:07
















1












1








1







I think you need a function to decode them, such as unescape on html.parser.



from html.parser import HTMLParser

unescape = HTMLParser().unescape
html = """
<html>
<body>
<div>
&lt;/div&gt; <----- how to convert the line into </div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>
"""

print(unescape(html))


Output



<html>
<body>
<div>
</div> <----- how to convert the line into </div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>





share|improve this answer













I think you need a function to decode them, such as unescape on html.parser.



from html.parser import HTMLParser

unescape = HTMLParser().unescape
html = """
<html>
<body>
<div>
&lt;/div&gt; <----- how to convert the line into </div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>
"""

print(unescape(html))


Output



<html>
<body>
<div>
</div> <----- how to convert the line into </div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>






share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 16 '18 at 5:41









kcorlidykcorlidy

2,2482619




2,2482619













  • @ kcorlidy, the logic doesn't work in case the html part is first parsed into BeautifulSoup like html=(, BeautifulSoup(open('C:\FolderTest.html'), 'html.parser')) error: a bytes-like object is required, not 'str'

    – Chris
    Nov 17 '18 at 0:01






  • 1





    i ran html=(BeautifulSoup(open('C:\FolderTest.html'), 'html.parser')) but i did not get such error. If you want to read as bytes, use open('C:\FolderTest.html','rb'). Btw you must close the file when you reading finished. Use with open('Test.html',"rb") as fd: html = BeautifulSoup(fd.read(), 'html.parser')

    – kcorlidy
    Nov 17 '18 at 2:07





















  • @ kcorlidy, the logic doesn't work in case the html part is first parsed into BeautifulSoup like html=(, BeautifulSoup(open('C:\FolderTest.html'), 'html.parser')) error: a bytes-like object is required, not 'str'

    – Chris
    Nov 17 '18 at 0:01






  • 1





    i ran html=(BeautifulSoup(open('C:\FolderTest.html'), 'html.parser')) but i did not get such error. If you want to read as bytes, use open('C:\FolderTest.html','rb'). Btw you must close the file when you reading finished. Use with open('Test.html',"rb") as fd: html = BeautifulSoup(fd.read(), 'html.parser')

    – kcorlidy
    Nov 17 '18 at 2:07



















@ kcorlidy, the logic doesn't work in case the html part is first parsed into BeautifulSoup like html=(, BeautifulSoup(open('C:\FolderTest.html'), 'html.parser')) error: a bytes-like object is required, not 'str'

– Chris
Nov 17 '18 at 0:01





@ kcorlidy, the logic doesn't work in case the html part is first parsed into BeautifulSoup like html=(, BeautifulSoup(open('C:\FolderTest.html'), 'html.parser')) error: a bytes-like object is required, not 'str'

– Chris
Nov 17 '18 at 0:01




1




1





i ran html=(BeautifulSoup(open('C:\FolderTest.html'), 'html.parser')) but i did not get such error. If you want to read as bytes, use open('C:\FolderTest.html','rb'). Btw you must close the file when you reading finished. Use with open('Test.html',"rb") as fd: html = BeautifulSoup(fd.read(), 'html.parser')

– kcorlidy
Nov 17 '18 at 2:07







i ran html=(BeautifulSoup(open('C:\FolderTest.html'), 'html.parser')) but i did not get such error. If you want to read as bytes, use open('C:\FolderTest.html','rb'). Btw you must close the file when you reading finished. Use with open('Test.html',"rb") as fd: html = BeautifulSoup(fd.read(), 'html.parser')

– kcorlidy
Nov 17 '18 at 2:07













0














Try using regular expressions instead.



Something like:



html = re.sub("&lt;", "<", html)


for less-than and



html = re.sub("&gt;", ">", html)


for greater-than.



Make sure you import re first.



Edit: for reference on how to use re.sub - https://lzone.de/examples/Python%20re.sub



Edit2: After some further research it seems like str.replace() is faster, so you may want to use that instead.






share|improve this answer


























  • @ jwoff, doesn't work in case the html is loaded from the file into BeautifulSoup first, do some operations and then try to replace at the end. I tried to convert the Beautifulsoap object into the string like str(html) and replace and convert back to beautifulsoup for nice output format, but there were some small unexpected changes to the structure of the code.

    – Chris
    Nov 17 '18 at 1:01


















0














Try using regular expressions instead.



Something like:



html = re.sub("&lt;", "<", html)


for less-than and



html = re.sub("&gt;", ">", html)


for greater-than.



Make sure you import re first.



Edit: for reference on how to use re.sub - https://lzone.de/examples/Python%20re.sub



Edit2: After some further research it seems like str.replace() is faster, so you may want to use that instead.






share|improve this answer


























  • @ jwoff, doesn't work in case the html is loaded from the file into BeautifulSoup first, do some operations and then try to replace at the end. I tried to convert the Beautifulsoap object into the string like str(html) and replace and convert back to beautifulsoup for nice output format, but there were some small unexpected changes to the structure of the code.

    – Chris
    Nov 17 '18 at 1:01
















0












0








0







Try using regular expressions instead.



Something like:



html = re.sub("&lt;", "<", html)


for less-than and



html = re.sub("&gt;", ">", html)


for greater-than.



Make sure you import re first.



Edit: for reference on how to use re.sub - https://lzone.de/examples/Python%20re.sub



Edit2: After some further research it seems like str.replace() is faster, so you may want to use that instead.






share|improve this answer















Try using regular expressions instead.



Something like:



html = re.sub("&lt;", "<", html)


for less-than and



html = re.sub("&gt;", ">", html)


for greater-than.



Make sure you import re first.



Edit: for reference on how to use re.sub - https://lzone.de/examples/Python%20re.sub



Edit2: After some further research it seems like str.replace() is faster, so you may want to use that instead.







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 16 '18 at 6:06

























answered Nov 16 '18 at 2:51









jwoffjwoff

76112




76112













  • @ jwoff, doesn't work in case the html is loaded from the file into BeautifulSoup first, do some operations and then try to replace at the end. I tried to convert the Beautifulsoap object into the string like str(html) and replace and convert back to beautifulsoup for nice output format, but there were some small unexpected changes to the structure of the code.

    – Chris
    Nov 17 '18 at 1:01





















  • @ jwoff, doesn't work in case the html is loaded from the file into BeautifulSoup first, do some operations and then try to replace at the end. I tried to convert the Beautifulsoap object into the string like str(html) and replace and convert back to beautifulsoup for nice output format, but there were some small unexpected changes to the structure of the code.

    – Chris
    Nov 17 '18 at 1:01



















@ jwoff, doesn't work in case the html is loaded from the file into BeautifulSoup first, do some operations and then try to replace at the end. I tried to convert the Beautifulsoap object into the string like str(html) and replace and convert back to beautifulsoup for nice output format, but there were some small unexpected changes to the structure of the code.

– Chris
Nov 17 '18 at 1:01







@ jwoff, doesn't work in case the html is loaded from the file into BeautifulSoup first, do some operations and then try to replace at the end. I tried to convert the Beautifulsoap object into the string like str(html) and replace and convert back to beautifulsoup for nice output format, but there were some small unexpected changes to the structure of the code.

– Chris
Nov 17 '18 at 1:01




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53330609%2fpython-html-how-to-modify-code-by-converting-text-outside-of-a-tag-into-a-ta%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Florida Star v. B. J. F.

Danny Elfman

Retrieve a Users Dashboard in Tumblr with R and TumblR. Oauth Issues