python - Python - 使用 lxml Xpath 功能保留一些 HTML 标记

Question

我正在使用 LXML Xpath 功能编写一些 HTML 解析器。它似乎工作正常，但我有一个主要问题。

解析所有 HTML标签时，有使用标签的单词等。我需要保留这些标签。

例如，在解析 HTML 时；

<div class="ArticleDetail">
    <p>Hello world, this is a <b>simple</b> test, which contains words in <i>italic</i> and others.
    I have a <strong>strong</strong> tag here. I guess this is a silly test.
    <br/>
    Ops, line breaks.
    <br/></p>

如果我运行这个 Python 代码；

x = lxml.html.fromstring("...html text...").xpath("//div[@class='ArticleDetail']/p")
for stuff in x:
    print stuff.text_content()

这似乎工作正常，但它删除了所有其他标签，而不仅仅是 p。

输出：

Hello world, this is a simple test, which contains words in italic and others.
I have a strong tag here. I guess this is a silly test.
Ops, line breaks.

如您所见，它删除了所有,和标签。无论如何你可以保留它们吗？

score 3 · Accepted Answer

您当前只检索文本内容，而不是 HTML 内容（包括标签）。

您想检索 XPath 匹配的所有子节点：

from lxml import etree

x = lxml.html.fromstring("...html text...").xpath("//div[@class='ArticleDetail']/p")
for elem in x:
    for child in elem.iterdescendants():
        print etree.tostring(child)

python - Python - 使用 lxml Xpath 功能保留一些 HTML 标记

1 回答 1

Related

Reference