python-2.7 - html中使用xpath的特定标签之后的下一个标签是什么

Question

我有这个 HTML 代码：

<a name="apple"></a>
<h3> header1 </h3>
<p> some text </p>
<p> some text1 </p>
<a name="orange"></a>
<h3> header2 </h3>
<p> some text 2 </p>

我想检索标题标签后的文本，使用如下代码：

for header in tree.iter('h3'):
 paragraph = header.xpath('(.//following::p)[1]')
 if (header.text=="apple"):
    print "%s: %s" % (header.text, paragraph[0].text)

<p>当我有多个标签时，它不起作用。如何找出<p>标题后有多少个标签并检索所有标签？

我使用 python 2.7 和 xpath。

score 2 · Accepted Answer

lxml使用's ( )可能更容易itersibling()，对兄弟姐妹而不是后代进行处理，然后在必要时对这些兄弟姐妹的后代进行处理。

你可以试试这样的

>>> for heading in root.iter("h3"):
...     print "----", heading
...     for sibling in heading.itersiblings():
...         if sibling.tag == 'h3':
...             break
...         print sibling
... 
---- <Element h3 at 0x1880470>
<Element p at 0x18800b0>
<Element p at 0x1880110>
<Element a at 0x1880170>
---- <Element h3 at 0x1880050>
<Element p at 0x18801d0>
>>>

如果你想使用 XPath，你可以使用 EXSLT 的set 扩展，它在lxml（通过"http://exslt.org/sets"命名空间）中可用，思路与上面大致相同：

选择所有兄弟姐妹 ( following-sibling::*)，
但排除 ( set:difference()) 下一个<h3>兄弟 ( following-sibling::h3) 和 ( |XPath 运算符) 所有其后续兄弟 ( following-sibling::h3/following-sibling::*)。

可以这样使用：

>>> following_siblings_untilh3 = lxml.etree.XPath("""
...         set:difference(
...             following-sibling::*,
...             (following-sibling::h3|following-sibling::h3/following-sibling::*))""",
...         namespaces={"set": "http://exslt.org/sets"})
>>> 
>>> for heading in root.iter("h3"):
...     print "----", heading
...     for e in following_siblings_noth3(heading): print e
... 
---- <Element h3 at 0x1880470>
<Element p at 0x18800b0>
<Element p at 0x1880110>
<Element a at 0x1880170>
---- <Element h3 at 0x1880050>
<Element p at 0x18801d0>
>>>

我相信它可以被简化。（我还没有找到following-sibling-or-self::h3...）

python-2.7 - html中使用xpath的特定标签之后的下一个标签是什么

1 回答 1

Related

Reference