python - 如何以与命名空间无关的方式在 Python 中通过 XPath 查找 XML 元素？

Question

因为我第二次遇到这个烦人的问题，所以我认为询问会有所帮助。

有时我必须从 XML 文档中获取元素，但是这样做的方法很尴尬。

我想知道一个 python 库，它可以做我想做的事，一种优雅的方式来制定我的 XPath，一种在前缀中自动注册命名空间的方法，或者在内置 XML 实现或 lxml 中隐藏首选项以完全剥离命名空间。除非你已经知道我想要什么，否则澄清如下：）

示例文档：

<root xmlns="http://really-long-namespace.uri"
  xmlns:other="http://with-ambivalent.end/#">
    <other:elem/>
</root>

我可以做什么

ElementTree API 是唯一（我知道的）提供 XPath 查询的内置 API。但它要求我使用“UNames”。这看起来像这样：/{http://really-long-namespace.uri}root/{http://with-ambivalent.end/#}elem

如您所见，这些内容非常冗长。我可以通过执行以下操作来缩短它们：

default_ns = "http://really-long-namespace.uri"
other_ns   = "http://with-ambivalent.end/#"
doc.find("/{{{0}}}root/{{{1}}}elem".format(default_ns, other_ns))

但这既 {{{ugly}}} 又脆弱，因为http…end/#≃ http…end#≃ http…end/≃ http…end，我有谁知道将使用哪个变体？

此外，lxml 支持命名空间前缀，但它既不使用文档中的前缀，也不提供处理默认命名空间的自动化方式。我仍然需要获取每个命名空间的一个元素才能从文档中检索它。命名空间属性没有保留，因此也无法从这些属性中自动检索它们。

也有一种与命名空间无关的 XPath 查询方式，但在内置实现中它既冗长/丑陋又不可用：/*[local-name() = 'root']/*[local-name() = 'elem']

我想做的事

我想找到一个库、选项或通用 XPath 变形函数来实现上述示例，只需键入以下内容……</p>

未命名空间：/root/elem
文档中的命名空间前缀：/root/other:elem

…加上一些我确实想使用文档前缀或剥离名称空间的语句。

进一步澄清：虽然我当前的用例就这么简单，但我将来必须使用更复杂的用例。

谢谢阅读！

解决了

用户 samplebias 将我的注意力引向py-dom-xpath；正是我想要的。我的实际代码现在如下所示：

#parse the document into a DOM tree
rdf_tree = xml.dom.minidom.parse("install.rdf")
#read the default namespace and prefix from the root node
context = xpath.XPathContext(rdf_tree)

name    = context.findvalue("//em:id", rdf_tree)
version = context.findvalue("//em:version", rdf_tree)

#<Description/> inherits the default RDF namespace
resource_nodes = context.find("//Description/following-sibling::*", rdf_tree)

与文档一致，简单，命名空间感知；完美的。

score 14 · Accepted Answer

语法应该可以工作，但为了*[local-name() = "elem"]更容易，您可以创建一个函数来简化部分或完整“通配符命名空间”XPath 表达式的构造。

我在 Ubuntu 10.04 上使用 python-lxml 2.2.4，下面的脚本对我有用。您需要根据要如何为每个元素指定默认命名空间来自定义行为，并处理要折叠到表达式中的任何其他 XPath 语法：

import lxml.etree

def xpath_ns(tree, expr):
    "Parse a simple expression and prepend namespace wildcards where unspecified."
    qual = lambda n: n if not n or ':' in n else '*[local-name() = "%s"]' % n
    expr = '/'.join(qual(n) for n in expr.split('/'))
    nsmap = dict((k, v) for k, v in tree.nsmap.items() if k)
    return tree.xpath(expr, namespaces=nsmap)

doc = '''<root xmlns="http://really-long-namespace.uri"
    xmlns:other="http://with-ambivalent.end/#">
    <other:elem/>
</root>'''

tree = lxml.etree.fromstring(doc)
print xpath_ns(tree, '/root')
print xpath_ns(tree, '/root/elem')
print xpath_ns(tree, '/root/other:elem')

输出：

[<Element {http://really-long-namespace.uri}root at 23099f0>]
[<Element {http://with-ambivalent.end/#}elem at 2309a48>]
[<Element {http://with-ambivalent.end/#}elem at 2309a48>]

更新：如果您发现确实需要解析 XPath，您可以查看py-dom-xpath之类的项目，它是（大部分）XPath 1.0 的纯 Python 实现。至少这会让您对解析 XPath 的复杂性有所了解。

score 2 · Accepted Answer

首先，关于“你想做什么”：

未命名空间：/root/elem-> 我想这里没问题
来自文档的命名空间前缀：/root/other:elem-> 好吧，这有点问题，您不能只使用“来自文档的命名空间前缀”。即使在一个文档中：
- 命名空间元素甚至不一定有前缀
- 相同的前缀不一定总是映射到相同的命名空间 uri
- 相同的命名空间 uri 不一定总是具有相同的前缀

仅供参考：如果您想获取某个元素范围内的前缀映射，请尝试elem.nsmap在 lxml 中。此外，lxml.etree 中的iterparse 和 iterwalk方法可用于“通知”命名空间声明。

python - 如何以与命名空间无关的方式在 Python 中通过 XPath 查找 XML 元素？

我可以做什么

我想做的事

解决了

2 回答 2

Related

Reference