python - 如何修复错误嵌套/未闭合的 HTML 标签？

Question

我需要通过以正确的嵌套顺序关闭任何打开的标签来清理用户提交的 HTML。我一直在寻找一种算法或 Python 代码来执行此操作，但除了 PHP 中的一些半生不熟的实现等外，没有找到任何东西。

例如，像

<p>
  <ul>
    <li>Foo

变成

<p>
  <ul>
    <li>Foo</li>
  </ul>
</p>

任何帮助，将不胜感激：）

score 32 · Accepted Answer

使用 BeautifulSoup：

from BeautifulSoup import BeautifulSoup
html = "<p><ul><li>Foo"
soup = BeautifulSoup(html)
print soup.prettify()

得到你

<p>
 <ul>
  <li>
   Foo
  </li>
 </ul>
</p>

据我所知，您无法控制将 <li></li> 标记放在与 Foo 不同的行上。

使用整洁：

import tidy
html = "<p><ul><li>Foo"
print tidy.parseString(html, show_body_only=True)

得到你

<ul>
<li>Foo</li>
</ul>

不幸的是，我不知道如何在示例中保留 <p> 标记。Tidy 将其解释为一个空段落而不是一个未闭合的段落，所以这样做

print tidy.parseString(html, show_body_only=True, drop_empty_paras=False)

出来作为

<p></p>
<ul>
<li>Foo</li>
</ul>

当然，最终，示例中的 <p> 标记是多余的，因此丢失它可能会很好。

最后，Tidy 还可以进行缩进：

print tidy.parseString(html, show_body_only=True, indent=True)

变成

<ul>
  <li>Foo
  </li>
</ul>

所有这些都有其起伏，但希望其中一个足够接近。

score 10 · Accepted Answer

10

通过Tidy或其移植的库之一运行它。

尝试手动编码，你会想挖出你的眼睛。

于 2008-11-16T04:17:52.173 回答

score 7 · Accepted Answer

7

使用 html5lib，效果很好！像这样。

汤= BeautifulSoup（数据，'html5lib'）

于 2017-08-23T07:08:12.593 回答

score 1 · Accepted Answer

刚才，我得到了一个 lxml 和 pyquery 不能正常工作的 html，似乎 html 中有一些错误。由于 Tidy 在 windows 中不容易安装，所以我选择BeautifulSoup. 但我发现：

from BeautifulSoup import BeautifulSoup
import lxml.html
soup = BeautifulSoup(page)
h = lxml.html(soup.prettify())

行为相同h = lxml.html(page)

真正解决我的问题的是soup = BeautifulSoup(page, 'html5lib')。
您应该html5lib先安装，然后可以将其用作BeautifulSoup. html5lib解析器似乎比其他人好得多。

希望这可以帮助某人。

score 1 · Accepted Answer

我尝试使用以下方法，但在python 3上失败

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(page, 'html5lib')

我在下面尝试并获得了成功

soup = bs4.BeautifulSoup(html, 'html5lib')
f_html = soup.prettify()
print(f'Formatted html::: {f_html}')

5 回答 5