16

我是 xml 解析和 Python 的新手,所以请耐心等待。我正在使用 lxml 来解析 wiki 转储,但我只想要每个页面、它的标题和文本。

现在我有这个:

from xml.etree import ElementTree as etree

def parser(file_name):
    document = etree.parse(file_name)
    titles = document.findall('.//title')
    print titles

目前,titles 没有返回任何内容。我看过以前的答案,例如:ElementTree findall() 返回空列表和 lxml 文档,但大多数事情似乎都是针对解析 HTML 量身定制的。

这是我的 XML 的一部分:

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.7/"     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.7/ http://www.mediawiki.org/xml/export-0.7.xsd" version="0.7" xml:lang="en">
  <siteinfo>
  <sitename>Wikipedia</sitename>
<base>http://en.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.20wmf9</generator>
<case>first-letter</case>
<namespaces>
  <namespace key="-2" case="first-letter">Media</namespace>
  <namespace key="-1" case="first-letter">Special</namespace>
  <namespace key="0" case="first-letter" />
  <namespace key="1" case="first-letter">Talk</namespace>
  <namespace key="2" case="first-letter">User</namespace>
  <namespace key="3" case="first-letter">User talk</namespace>
  <namespace key="4" case="first-letter">Wikipedia</namespace>
  <namespace key="5" case="first-letter">Wikipedia talk</namespace>
  <namespace key="6" case="first-letter">File</namespace>
  <namespace key="7" case="first-letter">File talk</namespace>
  <namespace key="8" case="first-letter">MediaWiki</namespace>
  <namespace key="9" case="first-letter">MediaWiki talk</namespace>
  <namespace key="10" case="first-letter">Template</namespace>
  <namespace key="11" case="first-letter">Template talk</namespace>
  <namespace key="12" case="first-letter">Help</namespace>
  <namespace key="13" case="first-letter">Help talk</namespace>
  <namespace key="14" case="first-letter">Category</namespace>
  <namespace key="15" case="first-letter">Category talk</namespace>
  <namespace key="100" case="first-letter">Portal</namespace>
  <namespace key="101" case="first-letter">Portal talk</namespace>
  <namespace key="108" case="first-letter">Book</namespace>
  <namespace key="109" case="first-letter">Book talk</namespace>
</namespaces>
  </siteinfo>
  <page>
    <title>Aratrum</title>
    <ns>0</ns>
    <id>65741</id>
    <revision>
  <id>349931990</id>
  <parentid>225434394</parentid>
  <timestamp>2010-03-15T02:55:02Z</timestamp>
  <contributor>
    <ip>143.105.193.119</ip>
  </contributor>
  <comment>/* Sources */</comment>
  <sha1>2zkdnl9nsd1fbopv0fpwu2j5gdf0haw</sha1>
  <text xml:space="preserve" bytes="1436">'''Aratrum''' is the Latin word for  [[plough]], and &quot;arotron&quot; (αροτρον) is the [[Greek language|Greek]] word. The   [[Ancient Greece|Greeks]] appear to have had diverse kinds of plough from the earliest  historical records. [[Hesiod]] advised the farmer to have always two ploughs, so that if  one broke the other might be ready for use. These ploughs should be of two kinds, the one  called &quot;autoguos&quot; (αυτογυος, &quot;self-limbed&quot;), in which the plough-tail  was of the same piece of timber as the share-beam and the pole; and the other called  &quot;pekton&quot; (πηκτον, &quot;fixed&quot;), because in it, three parts, which were of  three kinds of timber, were adjusted to one another, and fastened together by nails.

The ''autoguos'' plough was made from a [[sapling]] with two branches growing from its   trunk in opposite directions. In ploughing, the trunk served as the pole, one of the two     branches stood upwards and became the tail, and the other penetrated the ground and,    sometimes shod with bronze or iron, acted as the [[ploughshare]]. 

==Sources==
Based on an article from ''A Dictionary of Greek and Roman Antiquities,'' John Murray,     London, 1875.
ἄρατρον

==External links==
*[http://penelope.uchicago.edu/Thayer/E/Roman/Texts/secondary/SMIGRA*/Aratrum.html Smith's     Dictionary article], with diagrams, further details, sources.
[[Category:Agricultural machinery]]
[[Category:Ancient Greece]]
[[Category:Animal equipment]]</text>
</revision>
</page>

我也试过 iterparse 然后打印它找到的元素的标签:

for e in etree.iterparse(file_name):
    print e.tag

但它抱怨 e 没有标签属性。

编辑: 截屏

4

2 回答 2

31

问题是您没有考虑 XML 名称空间。XML 文档(以及其中的所有元素)位于http://www.mediawiki.org/xml/export-0.7/名称空间中。为了让它发挥作用,你需要改变

titles = document.findall('.//title')

titles = document.findall('.//{http://www.mediawiki.org/xml/export-0.7/}title')

命名空间也可以通过namespaces参数提供:

NSMAP = {'mw':'http://www.mediawiki.org/xml/export-0.7/'}
titles = document.findall('.//mw:title', namespaces=NSMAP)

这在 Python 2.7 中有效,但在Python 2.7 文档中没有解释(Python 3.3 文档更好)。

另请参阅http://effbot.org/zone/element-namespaces.htm和这个带有答案的 SO 问题:Parsing XML with namespace in Python via 'ElementTree'


问题iterparse()在于这个函数提供了(event, element)元组(不仅仅是元素)。为了获取标签名称,更改

for e in etree.iterparse(file_name):
    print e.tag

对此:

for e in etree.iterparse(file_name):
    print e[1].tag
于 2013-12-07T22:23:55.800 回答
0

首先,您需要找到父元素page. 我不知道这个嵌套了多少层,但是一旦找到它,您就可以立即获得title标签:

>>> page_tag = ET.fromstring(xdata)
>>> title_tag = page_tag.find('title')
>>> title_tag.text
'Aratrum'

随着更多信息的涌入,您可以这样做:

def parser(file_name):
    document = etree.parse(file_name)
    titles = []
    for page_tag in document.findall('page'):
        titles.append(page_tag.find('title').text)
    return titles

希望这可以帮助!

于 2013-12-06T23:56:23.433 回答