python - 在 lxml 中解析 html 正文片段

Question

我正在尝试解析 html 的片段：

<body><h1>title</h1><img src=""></body>

我用lxml.html.fromstring. 它让我发疯，因为它不断剥去<body>我碎片的标签：

 > lxml.html.fromstring('<html><h1>a</h1></html>').tag
 'html'
 > lxml.html.fromstring('<div><h1>a</h1></div>').tag
 'div'
 > lxml.html.fromstring('<body><h1>a</h1></body>').tag
 'h1'

我也尝试过document_fromstring, fragment_fromstring, clean_htmlwithpage_structure=False等...没有任何效果。

我需要使用 lxml，因为我将 html 片段传递给 PyQuery。

我只是希望 lxml 不会弄乱我的 html 片段。有可能这样做吗？

score 9 · Accepted Answer

.fragment_fromstring()也删除<html>标签；基本上，只要您没有HTML 文档（带有<html>顶级元素和/或 doctype），.fromstring()就会退回到.fragment_fromstring()该方法总是会同时删除 the<html>和<body>标签。

解决方法是告诉.fragment_fromstring()你给你一个<body> 父标签：

>>> lxml.html.fragment_fromstring('<body><h1>a</h1></body>', create_parent='body')
<Element body at 0x10d06fbf0>

这不会保留原始<body>标签上的任何属性。

另一种解决方法是使用该.document_fromstring()方法，它将您的文档包装在一个<html>标签中，然后您可以再次将其删除：

>>> lxml.html.document_fromstring('<body><h1>a</h1></body>')[0]
<Element body at 0x10d06fcb0>

这确实保留了以下属性<body>：

>>> lxml.html.document_fromstring('<body class="foo"><h1>a</h1></body>')[0].attrib
{'class': 'foo'}

在第一个示例中使用该.document_fromstring()函数给出：

>>> body = lxml.html.document_fromstring('<body><h1>title</h1><img src=""></body>')[0]
>>> lxml.html.tostring(body)
'<body><h1>title</h1><img src=""></body>'

如果您只想在没有HTML 标记的情况下执行此操作，请执行该lxml.html.fromstring()操作并测试完整文档：

htmltest = lxml.html._looks_like_full_html_bytes if isinstance(inputtext, str) else lxml.html._looks_like_full_html_unicode
if htmltest(inputtext):
    tree = lxml.html.fromstring(inputtext)
else:
    tree = lxml.html.document_fromstring(inputtext)[0]

python - 在 lxml 中解析 html 正文片段

1 回答 1

Related

Reference