0

我正在尝试自学如何解析 XML。我已经阅读了 lxml 教程,但它们很难理解。到目前为止,我可以做到:

>>> from lxml import etree
>>> xml=etree.parse('ham.xml')
>>> xml
<lxml.etree._ElementTree object at 0x118de60>

但是我怎样才能从这个对象中获取数据呢?它不能像 一样被索引xml[0],也不能被迭代。

更具体地说,我正在使用这个 xml 文件,并且我正在尝试提取标签之间的所有内容,这些标签被包含属性 的标签<l>包围。<sp>Barnardo

4

2 回答 2

2

它是一个ElementTreeElement对象

您还可以查看lxml API 文档,它有一个lxml.etree._Element页面。该页面会告诉您该类上您可能想知道的每个属性和方法。

但是,我将从阅读lxml.etree教程开始。

但是,如果该元素无法被索引,则它是一个空标签,并且没有要检索的子节点。

要通过 查找所有行Bernardo,需要一个带有命名空间映射的 XPath 表达式。使用什么前缀并不重要,只要它是一个非空字符串lxml,就会将它映射到正确的命名空间 URL:

nsmap = {'s': 'http://www.tei-c.org/ns/1.0'}

for line in tree.xpath('.//s:sp[@who="Barnardo"]/s:l/text()', namespaces=nsmap):
    print line.strip()

这会提取标签<l>中包含的元素中的所有文本。<sp who="Barnardo">注意s:标签名称的前缀,nsmap字典告诉lxml使用哪个命名空间。我打印了这些没有周围额外的空白。

对于您的示例文档,这给出了:

>>> for line in tree.xpath('.//s:sp[@who="Barnardo"]/s:l/text()', namespaces=nsmap):
...     print line.strip()
... 
Who's there?
Long live the king!
He.
'Tis now struck twelve; get thee to bed, Francisco.
Have you had quiet guard?
Well, good night.
If you do meet Horatio and Marcellus,
The rivals of my watch, bid them make haste.
Say,
What, is Horatio there?
Welcome, Horatio: welcome, good Marcellus.
I have seen nothing.
Sit down awhile;
And let us once again assail your ears,
That are so fortified against our story
What we have two nights seen.
Last night of all,
When yond same star that's westward from the pole
Had made his course to illume that part of heaven
Where now it burns, Marcellus and myself,
The bell then beating one,

In the same figure, like the king that's dead.
Looks 'a not like the king? mark it, Horatio.
It would be spoke to.
See, it stalks away!
How now, Horatio! you tremble and look pale:
Is not this something more than fantasy?
What think you on't?
I think it be no other but e'en so:
Well may it sort that this portentous figure
Comes armed through our watch; so like the king
That was and is the question of these wars.
'Tis here!
It was about to speak, when the cock crew.
于 2013-05-19T22:18:47.253 回答
2

解析 XML 的一种方法是使用XPath。在您的情况下,您可以调用 , 的xpath()成员函数。ElementTreexml

例如,打印所有<l>元素(剧本的行)的 XML。

subtrees = xml.xpath('//l', namespaces={'prefix': 'http://www.tei-c.org/ns/1.0'})
for l in subtrees:
    print(etree.tostring(l))

lxml 文档详细介绍了 xpath 功能。

正如下面所指出的,除非指定了命名空间,否则这不起作用。不幸的是,不支持空命名空间lxml,但您可以将根节点更改为使用名为 的命名空间prefix,这也是上面使用的名称。

<TEI xmlns:prefix="http://www.tei-c.org/ns/1.0" xml:id="sha-ham">
于 2013-05-19T22:44:57.647 回答