我正在使用 LXML Xpath 功能编写一些 HTML 解析器。它似乎工作正常,但我有一个主要问题。
解析所有 HTML<p>
标签时,有使用标签的单词<b>
等<i>
。我需要保留这些标签。
例如,在解析 HTML 时;
<div class="ArticleDetail">
<p>Hello world, this is a <b>simple</b> test, which contains words in <i>italic</i> and others.
I have a <strong>strong</strong> tag here. I guess this is a silly test.
<br/>
Ops, line breaks.
<br/></p>
如果我运行这个 Python 代码;
x = lxml.html.fromstring("...html text...").xpath("//div[@class='ArticleDetail']/p")
for stuff in x:
print stuff.text_content()
这似乎工作正常,但它删除了所有其他标签,而不仅仅是 p。
输出:
Hello world, this is a simple test, which contains words in italic and others.
I have a strong tag here. I guess this is a silly test.
Ops, line breaks.
如您所见,它删除了所有<b>
,<i>
和<strong>
标签。无论如何你可以保留它们吗?