我正在尝试在 python 2.5 和 2.7 上使用 LXML 解析器解析损坏的 HTML
与 LXML 文档 ( http://lxml.de/parsing.html#parsing-html ) 不同,解析损坏的 HTML 不起作用:
from lxml import etree
import StringIO
broken_html = "<html><head><title>test<body><h1>page title</h3>"
parser = etree.HTMLParser()
tree = etree.parse(StringIO.StringIO(broken_html))
结果:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 2954, in lxml.etree.parse (src/lxml/lxml.etree.c:56220)
File "parser.pxi", line 1550, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:82482)
File "parser.pxi", line 1578, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82764)
File "parser.pxi", line 1457, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:81562)
File "parser.pxi", line 965, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:78232)
File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74488)
File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75379)
File "parser.pxi", line 590, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74712)
lxml.etree.XMLSyntaxError: Opening and ending tag mismatch: h1 line 1 and h3, line 1, column 50