python - 如何可靠地从元标记中提取属性、内容？

Question

我有例如。以下几行 HTML。我需要提取并获取og:image属性content列表。问题是，如果我像简单的 string.split() 那样做，下面几行的结果将不一样，因为第二行的content值有很多空格。

我怎样才能可靠地处理这样的字符串行并获得类似的列表： ['og:image', 'http....whatever.jpg']第二行相同？

 <meta property="og:image" content="http://google.com/example.jpg"/>
 <meta property="og:title" content="Fant over 300 falske personer i skattelistene"/>

编辑：我现在这样解析：

tree = etree.HTML( xml )
m = tree.xpath("//meta[@property]")
for i in m:
    og = etree.tostring( i )
    print og # <meta property="og:image" content="http://google.com/example.jpg"/>

也许有一种方法可以直接使用 XPath 将内容/属性放入列表中？

score 1 · Accepted Answer

无需将元素转换回字符串，只需通过attrib每个元素的映射来获取属性：

for i in m:
    print (i.attrib['property'], i.attrib['content'])

python - 如何可靠地从元标记中提取属性、内容？

1 回答 1

Related

Reference