python - python：带冒号的lxml xpath标签名称

Question

我必须解析一些提要，但元素（标签）之一是colon <dc:creator>leemore23</dc:creator>

我如何使用它来解析它lxml？所以我这样做了

r = requests.get('http://www.site.com/feed/')
foo = (r.content).replace("dc:creator","dc")
tree = lxml.etree.fromstring(foo)
for article_node in tree.xpath('//item'):
    data['dc'] = article_node.xpath('.//dc')[0].text.strip()

但我认为有更好的方法，比如

data['dc'] = article_node.xpath('.//dc:creator')[0].text.strip()

或者

data['dc'] = article_node.xpath('.//dc|creator')[0].text.strip()

所以无需更换

你有什么建议吗？

score 3 · Accepted Answer

dc:前缀表示一个XML 命名空间。使用elementtree API 命名空间支持来处理它，而不仅仅是从输入中删除它。碰巧，dc通常是指都柏林核心元数据。

您需要确定完整的命名空间 URL，然后在 XPath 查询中使用该 URL：

DCNS = 'http://purl.org/dc/elements/1.1/'
creator = article_node.xpath('.//{{{0}}}creator'.format(DCNS))

http://purl.org/dc/elements/1.1/在这里，我为都柏林核心前缀使用了推荐的命名空间 URL。

您通常可以从.nsmap属性中确定 URL；您的根元素可能具有以下.nsmap属性：

{'dc': 'http://purl.org/dc/elements/1.1/'}

因此您可以将代码更改为：

creator = article_node.xpath('.//{{{0}}}creator'.format(article_node.nsmap['dc']))

nsmap这可以通过将字典xpath()作为关键字传递给方法来进一步简化namespaces，此时您可以在 xpath 表达式中使用前缀：

creator = article_node.xpath('.//dc:creator', namespaces=article_node.nsmap)

score 2 · Accepted Answer

dc:表示命名空间。使用lxml'sxpath方法时，使用namespaces参数搜索命名空间中的元素。

因此，在您的情况下，使用@MartijnPieters 提供的都柏林核心前缀，

r = requests.get('http://www.site.com/feed/')
tree = lxml.etree.fromstring(r.content)
ns = {'dc':'http://purl.org/dc/elements/1.1/'}
for article_node in tree.xpath('//item'):
    data['dc'] = article_node.xpath('.//dc:creator', namespaces = ns)[0].text.strip()

python - python：带冒号的lxml xpath标签名称

2 回答 2

Related

Reference