python - 使用 html5lib 验证 HTML 片段

Question

我正在使用 Python 和html5lib来检查在表单字段中输入的一些 HTML 代码是否有效。

我尝试了下面的代码来测试一个有效的片段，但我得到了一个意外的错误（至少对我来说）：

>>> import html5lib
>>> from html5lib.filters import lint
>>> fragment = html5lib.parseFragment('<p><script>alert("Boo!")</script></p>')
>>> walker = html5lib.getTreeWalker('etree')
>>> [i for i in lint.Filter(walker(fragment))]
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/xyz/html5lib-1.0b3-py2.7.egg/html5lib/filters/lint.py", line 28, in __iter__
    raise LintError(_("Tag name is not a string: %(tag)r") % {"tag": name})
LintError: Tag name is not a string: u'p'

我做错了什么？

我的默认编码是utf-8：

>>> import sys
>>> sys.getdefaultencoding()
'utf-8'

score 2 · Accepted Answer

lint 过滤器不会尝试验证 HTML（嗯，是的，需要文档，非常糟糕……这是还没有 1.0 版本的很大一部分原因），它只是验证是否遵守了 treewalker API。除非它不是因为问题 #172而损坏。

html5lib 不会尝试提供任何验证器，因为实现 HTML 验证器需要大量工作。

我不知道除了Validator.nu之外的任何合理完整的验证器，尽管它是用 Java 编写的。但是，它提供了一个可能适合您的目的的 Web API。

score 1 · Accepted Answer

“严格”解析模式可用于检测错误：

>>> import html5lib
>>> html5parser = html5lib.HTMLParser(strict=True)
>>> html5parser.parseFragment('<p>Lorem <a href="/foobar">ipsum</a>')
<Element 'DOCUMENT_FRAGMENT' at 0x7f1d4a58fd60>
>>> html5parser.parseFragment('<p>Lorem </a>ipsum<a href="/foobar">')
Traceback (most recent call last):
  ...
html5lib.html5parser.ParseError: Unexpected end tag (a). Ignored.
>>> html5parser.parseFragment('<p><form></form></p>')
Traceback (most recent call last):
  ...
html5lib.html5parser.ParseError: Unexpected end tag (p). Ignored.
>>> html5parser.parseFragment('<option value="example" />')
Traceback (most recent call last):
  ...
html5lib.html5parser.ParseError: Trailing solidus not allowed on element option

python - 使用 html5lib 验证 HTML 片段

2 回答 2

Related

Reference