python - Iterating an XML file and extracting data from it

Question

I have this XML file:

<movie id = 0> 
  <Movie_name>The Shawshank Redemption   </Movie_name> 
  <Address>http://www.imdb.com/title/tt0111161/
  </Address> 
  <year>1994  </year> 
  <stars>Tim Robbins  Morgan Freeman  Bob Gunton    </stars> 
  <plot> plot...
  </plot> 
  <keywords>Reviews, Showtimes</keywords>
</movie>

<movie id = 1> 
  <Movie_name>Inglourious Basterds   </Movie_name> 
  <Address>http://www.imdb.com/title/tt0361748/
  </Address> 
  <year>2009  </year> 
  <stars>Brad Pitt  M&#xE9;lanie Laurent  Christoph Waltz    </stars> 
  <plot>plot/... 
  </plot> 
  <keywords>Reviews, credits  </keywords>
</movie>

How can iterate the file extracting for each movie its speciffic data? I mean for movie 0: its name, address, year and so on.

The input file structure is mandatory, so data extraction can be done while looping.

Much thanks.

score 3 · Accepted Answer

You will want to check out xml.etree.ElementTree.

I'd also note that what you have there is not valid XML, so you might run into issues. Valid XML would probably look more like this:

<movie id="0"> 
  <name>The Shawshank Redemption</name> 
  <url>http://www.imdb.com/title/tt0111161/</url> 
  <year>1994</year> 
  <stars>
    <star>Tim Robbins</star>
    <star>Morgan Freeman</star>
    <star>Bob Gunton</star>
  </stars> 
  <plot>plot...</plot> 
  <keywords>
    <keyword>Reviews</keyword>
    <keyword>Showtimes</keyword>
  </keywords>
</movie>

Note the lowercase tag names and attributes (<movieNum = 0> doesn't make sense). You will also want an XML declaration (like <?xml version="1.0" encoding="UTF-8" ?>) at the top. You can validate your XML at XML Validation, or using xmllint, for example.

Once you have valid XML, you can parse it and iterate over it using iterparse(), or parse it and then iterate over the constructed element tree.

score 3 · Accepted Answer

编辑——采用改进的 XML 输入

我强烈建议您尝试在@Lattyware 的评论中验证您的输入。我发现对于无效的 XML 和 HTML，BeautifulSoup 可以很好地恢复可用的东西。这是快速尝试的作用：

from BeautifulSoup import BeautifulSoup

# Note: I have added the <movielist> root element
xml = """<movielist>
<movie id = 0> 
  <Movie_name>The Shawshank Redemption   </Movie_name> 
  <Address>http://www.imdb.com/title/tt0111161/
  </Address> 
  <year>1994  </year> 
  <stars>Tim Robbins  Morgan Freeman  Bob Gunton    </stars> 
  <plot> plot...
  </plot> 
  <keywords>Reviews, Showtimes</keywords>
</movieNum>

<movie id = 1> 
  <Movie_name>Inglourious Basterds   </Movie_name> 
  <Address>http://www.imdb.com/title/tt0361748/
  </Address> 
  <year>2009  </year> 
  <stars>Brad Pitt  M&#xE9;lanie Laurent  Christoph Waltz    </stars> 
  <plot>plot/... 
  </plot> 
  <keywords>Reviews, credits  </keywords>
</movieNum>

</movielist>"""

soup = BeautifulSoup(xml)
movies = soup.findAll('movie')

for movie in movies:
    id_tag = movie['id']
    name = movie.find("movie_name").text
    url = movie.find("address").text
    year = movie.find("year").text
    stars = movie.find("stars").text
    plot = movie.find("plot").text
    keywords = movie.find("keywords").text
    for item in (id_tag, name, url, year, stars, plot, keywords):
        print item
    print '=' * 50

这将输出以下内容（现在可以访问 ID 标签）：

0
The Shawshank Redemption
http://www.imdb.com/title/tt0111161/
1994
Tim Robbins  Morgan Freeman  Bob Gunton
plot...
Reviews, Showtimes
==================================================
1
Inglourious Basterds
http://www.imdb.com/title/tt0361748/
2009
Brad Pitt  M&#xE9;lanie Laurent  Christoph Waltz
plot/...
Reviews, credits
==================================================

希望它能给你一个开始......它只能从这里变得更好。

score 2 · Accepted Answer

BeutifulSoup 更宽容，它也可以用于 HTML（其中一些封闭标签是可选的）。仅当 XML 有效时才能使用 ElementTree。您可以通过将片段包装到单个元素来使其部分有效。属性值必须用引号引起来。尝试以下方法，其中Movie创建了类以从一个电影元素中捕获信息。类派生自dict，与dict一样灵活；但是，您可以创建自己的方法来从收集的信息中返回处理后的值：

# -*- coding: utf-8 -*-
import xml.etree.ElementTree as ET

class Movie(dict):

    def __init__(self, movie_element):
        assert movie_element.tag == 'movie'  # we are able to process only that
        self['id'] = movie_element.attrib['id']  
        for e in movie_element:
            self[e.tag] = e.text.strip()

    def name(self):
        return self['Movie_name']

    def url(self):
        return self['Address']

    def year(self):
        return self['year']     

    def stars(self):
        return self['stars']

    def plot(self):
        return self['plot']

    def keywords(self):
        return self['keywords']

    def __str__(self):
        lst = []
        lst.append(self.name() + ' (' + self.year() + ')')
        lst.append(self.stars())
        lst.append(self.url())
        return '\n'.join(lst)


fragment = '''\
<movie id = "0"> 
  <Movie_name>The Shawshank Redemption   </Movie_name> 
  <Address>http://www.imdb.com/title/tt0111161/
  </Address> 
  <year>1994  </year> 
  <stars>Tim Robbins  Morgan Freeman  Bob Gunton    </stars> 
  <plot> plot...
  </plot> 
  <keywords>Reviews, Showtimes</keywords>
</movie>

<movie id = "1"> 
  <Movie_name>Inglourious Basterds   </Movie_name> 
  <Address>http://www.imdb.com/title/tt0361748/
  </Address> 
  <year>2009  </year> 
  <stars>Brad Pitt  Melanie Laurent  Christoph Waltz    </stars> 
  <plot>plot/... 
  </plot> 
  <keywords>Reviews, credits  </keywords>
</movie>
'''

fixed_fragment = '<root>\n' + fragment + '</root>'
##print fixed_fragment

tree = ET.fromstring(fixed_fragment)
movies = []
for m in tree:
    movies.append(Movie(m))

for movie in movies:
    print '\n------------------'
    print movie

它打印在我的控制台上：

------------------
The Shawshank Redemption (1994)
Tim Robbins  Morgan Freeman  Bob Gunton
http://www.imdb.com/title/tt0111161/

------------------
Inglourious Basterds (2009)
Brad Pitt  Melanie Laurent  Christoph Waltz
http://www.imdb.com/title/tt0361748/

请注意，我已经替换了非 ASCII 字符——编码问题要单独解决。

python - Iterating an XML file and extracting data from it

3 回答 3

Related

Reference