lxml
使用's ( )可能更容易itersibling()
,对兄弟姐妹而不是后代进行处理,然后在必要时对这些兄弟姐妹的后代进行处理。
你可以试试这样的
>>> for heading in root.iter("h3"):
... print "----", heading
... for sibling in heading.itersiblings():
... if sibling.tag == 'h3':
... break
... print sibling
...
---- <Element h3 at 0x1880470>
<Element p at 0x18800b0>
<Element p at 0x1880110>
<Element a at 0x1880170>
---- <Element h3 at 0x1880050>
<Element p at 0x18801d0>
>>>
如果你想使用 XPath,你可以使用 EXSLT 的set 扩展,它在lxml
(通过"http://exslt.org/sets"
命名空间)中可用,思路与上面大致相同:
- 选择所有兄弟姐妹 (
following-sibling::*
),
- 但排除 (
set:difference()
) 下一个<h3>
兄弟 ( following-sibling::h3
) 和 ( |
XPath 运算符) 所有其后续兄弟 ( following-sibling::h3/following-sibling::*
)。
可以这样使用:
>>> following_siblings_untilh3 = lxml.etree.XPath("""
... set:difference(
... following-sibling::*,
... (following-sibling::h3|following-sibling::h3/following-sibling::*))""",
... namespaces={"set": "http://exslt.org/sets"})
>>>
>>> for heading in root.iter("h3"):
... print "----", heading
... for e in following_siblings_noth3(heading): print e
...
---- <Element h3 at 0x1880470>
<Element p at 0x18800b0>
<Element p at 0x1880110>
<Element a at 0x1880170>
---- <Element h3 at 0x1880050>
<Element p at 0x18801d0>
>>>
我相信它可以被简化。(我还没有找到following-sibling-or-self::h3
...)