Question:
Hi, I came across incomplete html codes where the "html" and "body" tags are missing.
Here's the code I implemented:
import bs4
content='''
<head>
<title>
my page
</title>
</head>
<table border="0" cellpadding="0" cellspacing="0">
<tr>
<td>
<p>
<img alt="Brastra.gif (4376 bytes)" height="82" src="../../DNN/Brastra.gif"/>
</p>
</td>
<td>
<p>
<strong>
Titulo 1
<br/>
Titulo 2
<br/>
Titulo 3
</strong>
</p>
</td>
</tr>
</table>
<small>
<strong>
<a href="http://example.com/">
Link.
</a>
</strong>
</small>
<p>
<a href="http://example.com/">I linked to <i>example.com</i></a>
</p>
<p>#1</p>
<p>#2</p>
'''
soup = bs4.Beautifulsoup(content, 'html.parser')
I tried the excerpt below which has an error.
tag = soup.new_tag('html')
tag.wrap(soup)
ValueError: Cannot replace one element with another when theelement to
be replaced is not part of a tree.
E tentei este outro que mistura a ordem das tags:
for item in soup.find_all():
tag.append(item.extract())
soup = tag
<body>
<head>
</head>
<title>
my page
</title>
<div>
</div>
<center>
</center>
<table border="0" cellpadding="0" cellspacing="0">
</table>
<tr>
</tr>
<td>
</td>
How can I solve my problem with bs4, to wrap the code with 'body' and 'html' tags?
Answer:
For this you will need the html5lib
parser.
pip install html5lib
I tried it on my console and this was the result:
In [2]:import bs4
In [3]:content='''
<head>
<title>
my page
</title>
</head>
<table border="0" cellpadding="0" cellspacing="0">
<tr>
<td>
<p>
<img alt="Brastra.gif (4376 bytes)" height="82" src="../../DNN/Brastra.gif"/>
</p>
</td>
<td>
<p>
<strong>
Titulo 1
<br/>
Titulo 2
<br/>
Titulo 3
</strong>
</p>
</td>
</tr>
</table>
<small>
<strong>
<a href="http://example.com/">
Link.
</a>
</strong>
</small>
<p>
<a href="http://example.com/">I linked to <i>example.com</i></a>
</p>
<p>#1</p>
<p>#2</p>
'''
In [4]: soup = bs4.Beautifulsoup(content, 'html5lib')
In [5]: soup
Out[5]:
<html><head>
<title>
my page
</title>
</head>
<body><table border="0" cellpadding="0" cellspacing="0">
<tbody><tr>
<td>
<p>
<img alt="Brastra.gif (4376 bytes)" height="82" src="../../DNN/Brastra.gif"/>
</p>
</td>
<td>
<p>
<strong>
Titulo 1
<br/>
Titulo 2
<br/>
Titulo 3
</strong>
</p>
</td>
</tr>
</tbody></table>
<small>
<strong>
<a href="http://example.com/">
Link.
</a>
</strong>
</small>
<p>
<a href="http://example.com/">I linked to <i>example.com</i></a>
</p>
<p>#1</p>
<p>#2</p>
</body></html>