python - Python - 使用 Tidy 进行 HTML 解析

Question

这段代码需要一些糟糕的 html，使用 Tidy 库对其进行清理，然后将其传递给 HtmlLib.Reader()。

import tidy
options = dict(output_xhtml=1, 
                add_xml_decl=1, 
                indent=1, 
                tidy_mark=0)

from xml.dom.ext.reader import HtmlLib
reader = HtmlLib.Reader()

doc = reader.fromString(tidy.parseString("<Html>Bad Html.", **options))

我似乎没有使用正确的类型传递 fromString 与此 Traceback：

Traceback (most recent call last):
  File "getComicEmbed.py", line 33, in <module>
    doc = reader.fromString(tidy.parseString("<Html>Bad Html.</b>", **options))
  File "C:\Python26\lib\site-packages\_xmlplus\dom\ext\reader\HtmlLib.py", line 67, in fromString
stream = reader.StrStream(str)
  File "C:\Python26\lib\site-packages\_xmlplus\dom\ext\reader\__init__.py", line 24, in StrStream
return cStringIO.StringIO(st)
TypeError: expected read buffer, _Document found

我应该怎么做？谢谢！

score 4 · Accepted Answer

tidy 的parseString函数返回一个_Document实现__str__但不是缓冲区接口的实例。因此HtmlLib.Reader().fromString无法从中创建StringIO对象。

这应该相当简单，改变：

doc = reader.fromString(tidy.parseString("<Html>Bad Html.", **options))

到

doc = reader.fromString(str(tidy.parseString("<Html>Bad Html.", **options)))

score 1 · Accepted Answer

我没有使用 Pythontidy模块，也不知道如何找到它，但看起来你需要调用类似toString的结果tidy.fromString来将解析的文档转换回 XHTML。

对于不同的方法，您可以考虑使用lxml.html，它可以很好地解析损坏的标记，并为您提供了一个很棒的 ElementTree API 来处理结果。它还可以漂亮地打印 *ML，这使它成为一个 tidy 的超集，尽管可能不具有完全相同的导航不连贯标记的能力。

另外：lxml 是用 C 编写的（实际上，就像 pythontidy模块一样，只是包装了一个 C 库），因此它比其他一些用于处理 XML 的 python 模块快得多。

python - Python - 使用 Tidy 进行 HTML 解析

2 回答 2

Related

Reference