python - 获取lxml中的项目符号列表

Question

所以我有一个这样的html：

...
<ul class="myclass">
    <li>blah</li>
    <li>blah2</li>
</ul>
...

我想从类名“myclass”的 ul 中获取文本“blah”和“blah2”

所以我尝试使用innerhtml()，但由于某种原因它不适用于lxml。

我正在使用 Python 3。

score 1 · Accepted Answer

我会尝试：

doc.xpath('.//ul[@class = "myclass"]/li/text()')
# out: ["blah","blah2"]

编辑：

what if there was a <a> in the <li>? for example, how would I get "link" and text" from <li><a href="link">text</a></li>?

link = doc.xpath('.//ul[@class = "myclass"]/li/a/@href')
txt= doc.xpath('.//ul[@class = "myclass"]/li/a/text()')

如果你愿意，你可以将它们结合起来，如果我们以@larsmans 为例，你可以使用它'//'来获取整个文本，因为我相信 lxml 不支持string()表达式中的方法。

doc.xpath('.//ul[@class="myclass"]/li[a]//text() | .//ul[@class="myclass"]/li/a/@href')
# out: ['I contain a ', 'http://example.com', 'link', '.']

此外，您可以使用以下text_content()方法：

html=\
"""
<html>
<ul class="myclass">
    <li>I contain a <a href="http://example.com">link</a>.</li>
    <li>blah</li>
    <li>blah2</li>
</ul>
</html>
"""
import lxml.html as lh
doc=lh.fromstring(html)
for elem in doc.xpath('.//ul[@class="myclass"]/li'):
    print elem.text_content()

印刷：

#I contain a link.
#blah
#blah2

python - 获取lxml中的项目符号列表

1 回答 1

Related

Reference