python - 遍历xml以使用python查找具有特定扩展名的url

Question

我有一个从 url 下载的 xml 文件。然后我想遍历 xml 以找到指向具有特定文件扩展名的文件的链接。

我的 xml 看起来像这样：

<Foo>
    <bar>
        <file url="http://foo.txt"/>
        <file url="http://bar.doc"/>
    </bar>
</Foo>

我已经编写了代码来获取这样的 xml 文件：

import urllib2, re
from xml.dom.minidom import parseString

file = urllib2.urlopen('http://foobar.xml')
data = file.read()
file.close()
dom = parseString(data)
xmlTag = dom.getElementsByTagName('file')

然后我'想'让这样的事情起作用：

   i=0
    url = ''
    while( i < len(xmlTag)):
         if re.search('*.txt', xmlTag[i].toxml() ) is not None:
              url = xmlTag[i].toxml()
         i = i + 1;

** Some code that parses out the url **

但这会引发错误。有人对更好的方法有提示吗？

谢谢！

score 4 · Accepted Answer

坦率地说，您的最后一段代码令人作呕。dom.getElementsByTagName('file')为您提供树中所有<file>元素的列表...只需对其进行迭代。

urls = []
for file_node in dom.getElementsByTagName('file'):
    url = file_node.getAttribute('url')
    if url.endswith('.txt'):
        urls.append(url)

顺便说一句，您永远不必使用 Python 手动进行索引。即使在极少数情况下您需要索引号，也只需使用 enumerate：

mylist = ['a', 'b', 'c']
for i, value in enumerate(mylist):
    print i, value

score 3 · Accepted Answer

lxml使用,urlparse和的示例os.path：

from lxml import etree
from urlparse import urlparse
from os.path import splitext

data = """
<Foo>
    <bar>
        <file url="http://foo.txt"/>
        <file url="http://bar.doc"/>
    </bar>
</Foo>
"""

tree = etree.fromstring(data).getroottree()
for url in tree.xpath('//Foo/bar/file/@url'):
    spliturl = urlparse(url)
    name, ext = splitext(spliturl.netloc)
    print url, 'is is a', ext, 'file'

python - 遍历xml以使用python查找具有特定扩展名的url

2 回答 2

Related

Reference