python - Beautifulsoup 兄弟结构与 br 标签

Question

我正在尝试使用 BeautifulSoup Python 库解析 HTML 文档，但结构被扭曲了<br>标签扭曲了。让我举个例子。

输入 HTML：

<div>
  some text <br>
  <span> some more text </span> <br>
  <span> and more text </span>
</div>

BeautifulSoup 解释的 HTML：

<div>
  some text
  <br>
    <span> some more text </span>
    <br>
      <span> and more text </span>
    </br>
  </br>
</div>

在源代码中，跨度可以被认为是兄弟姐妹。解析后（使用默认解析器），跨度突然不再是兄弟，因为 br 标签成为结构的一部分。

我能想到的解决方案是<br>在将 html 倒入 Beautifulsoup 之前完全去除标签，但这似乎不太优雅，因为它需要我更改输入。有什么更好的方法来解决这个问题？

score 10 · Accepted Answer

你最好的选择是extract()换行。这比你想象的要容易:)。

>>> from bs4 import BeautifulSoup as BS
>>> html = """<div>
...   some text <br>
...   <span> some more text </span> <br>
...   <span> and more text </span>
... </div>"""
>>> soup = BS(html)
>>> for linebreak in soup.find_all('br'):
...     linebreak.extract()
... 
<br/>
<br/>
>>> print soup.prettify()
<html>
 <body>
  <div>
   some text
   <span>
    some more text
   </span>
   <span>
    and more text
   </span>
  </div>
 </body>
</html>

score 6 · Accepted Answer

6

你也可以这样做：

str(soup).replace("</br>", "")

于 2014-06-27T17:07:44.087 回答

score 6 · Accepted Answer

这是一个非常古老的问题，但我也遇到了类似的问题，因为我的文档包含 closong</br>标签。正因为如此，beatifulsoup 简单地忽略了大量文档（我假设 bs 试图处理结束标签。）soup.find_all('br')实际上没有找到任何东西，因为没有开始br标签，所以我无法使用该extract()方法。

在敲了一个小时之后，我发现使用lxml解析器而不是默认的 html 解决了这个问题。

soup = BeautifulSoup(page, 'lxml')

python - Beautifulsoup 兄弟结构与 br 标签

3 回答 3

Related

Reference