2

我正在使用 LXML Xpath 功能编写一些 HTML 解析器。它似乎工作正常,但我有一个主要问题。

解析所有 HTML<p>标签时,有使用标签的单词<b><i>。我需要保留这些标签。

例如,在解析 HTML 时;

<div class="ArticleDetail">
    <p>Hello world, this is a <b>simple</b> test, which contains words in <i>italic</i> and others.
    I have a <strong>strong</strong> tag here. I guess this is a silly test.
    <br/>
    Ops, line breaks.
    <br/></p>

如果我运行这个 Python 代码;

x = lxml.html.fromstring("...html text...").xpath("//div[@class='ArticleDetail']/p")
for stuff in x:
    print stuff.text_content()

这似乎工作正常,但它删除了所有其他标签,而不仅仅是 p。

输出:

Hello world, this is a simple test, which contains words in italic and others.
I have a strong tag here. I guess this is a silly test.
Ops, line breaks.

如您所见,它删除了所有<b>,<i><strong>标签。无论如何你可以保留它们吗?

4

1 回答 1

3

您当前只检索文本内容,而不是 HTML 内容(包括标签)。

您想检索 XPath 匹配的所有子节点:

from lxml import etree

x = lxml.html.fromstring("...html text...").xpath("//div[@class='ArticleDetail']/p")
for elem in x:
    for child in elem.iterdescendants():
        print etree.tostring(child)
于 2012-09-05T13:28:23.583 回答