下面展示了如何使用lxml
.
>>> from lxml.etree import fromstring
>>> tree = fromstring('''<section> Fubar, I'm so fubar, fubar and even more <fref bar="baz">fubare</fref>. And yet more fubar. </section>''')
>>> elem = tree.xpath('/section/fref')[0]
>>> elem.text
'fubare'
>>> elem.tail
'. And yet more fubar. '
>>> elem.getparent().text
" Fubar, I'm so fubar, fubar and even more "
从lxml.etree
教程:
如果您只想读取文本,即没有任何中间标签,您必须以正确的顺序递归连接所有文本和尾部属性。再次, tostring() 函数来救援,这次使用 method 关键字:
>>> from lxml.etree import tostring
>>> tostring(html, method="text")
" Fubar, I'm so fubar, fubar and even more fubare. And yet more fubar. "
还有一种 XPath 方法可以做到这一点,它在链接页面中进行了描述。