python - Python：将信息从 xml 提取到字典

Question

我需要从 xml 文件中提取信息，将其与前后的 xml 标签隔离，将信息存储在字典中，然后循环遍历字典以打印列表。我是一个绝对的初学者，所以我想让它尽可能简单，如果我描述的我想做的事情没有多大意义，我深表歉意。

这是我到目前为止所拥有的。

for line in open("/people.xml"):
if "name" in line:
    print (line)
if "age" in line:
    print(line)

电流输出：

     <name>John</name>

  <age>14</age>

    <name>Kevin</name>

  <age>10</age>

    <name>Billy</name>

  <age>12</age>

期望的输出

Name          Age
John          14
Kevin         10
Billy         12

编辑 - 所以使用下面的代码我可以得到输出：

{'Billy': '12', 'John': '14', 'Kevin': '10'}

有谁知道如何从这个得到一个带有我想要的输出标题的图表？

score 3 · Accepted Answer

尝试xmldict（将 xml 转换为 python 字典，反之亦然。）：

>>> xmldict.xml_to_dict('''
... <root>
...   <persons>
...     <person>
...       <name first="foo" last="bar" />
...     </person>
...     <person>
...       <name first="baz" last="bar" />
...     </person>
...   </persons>
... </root>
... ''')
{'root': {'persons': {'person': [{'name': {'last': 'bar', 'first': 'foo'}}, {'name': {'last': 'bar', 'first': 'baz'}}]}}}


# Converting dictionary to xml 
>>> xmldict.dict_to_xml({'root': {'persons': {'person': [{'name': {'last': 'bar', 'first': 'foo'}}, {'name': {'last': 'bar', 'first': 'baz'}}]}}})
'<root><persons><person><name><last>bar</last><first>foo</first></name></person><person><name><last>bar</last><first>baz</first></name></person></persons></root>'

或尝试xmlmapper（具有父子关系的 python 字典列表）：

  >>> myxml='''<?xml version='1.0' encoding='us-ascii'?>
          <slideshow title="Sample Slide Show" date="2012-12-31" author="Yours Truly" >
          <slide type="all">
              <title>Overview</title>
              <item>Why
                  <em>WonderWidgets</em>
                     are great
                  </item>
                  <item/>
                  <item>Who
                  <em>buys</em>
                  WonderWidgets1
              </item>
          </slide>
          </slideshow>'''
  >>> x=xml_to_dict(myxml)
  >>> for s in x:
          print s
  >>>
  {'text': '', 'tail': None, 'tag': 'slideshow', 'xmlinfo': {'ownid': 1, 'parentid': 0}, 'xmlattb': {'date': '2012-12-31', 'author': 'Yours Truly', 'title': 'Sample Slide Show'}}
  {'text': '', 'tail': '', 'tag': 'slide', 'xmlinfo': {'ownid': 2, 'parentid': 1}, 'xmlattb': {'type': 'all'}}
  {'text': 'Overview', 'tail': '', 'tag': 'title', 'xmlinfo': {'ownid': 3, 'parentid': 2}, 'xmlattb': {}}
  {'text': 'Why', 'tail': '', 'tag': 'item', 'xmlinfo': {'ownid': 4, 'parentid': 2}, 'xmlattb': {}}
  {'text': 'WonderWidgets', 'tail': 'are great', 'tag': 'em', 'xmlinfo': {'ownid': 5, 'parentid': 4}, 'xmlattb': {}}
  {'text': None, 'tail': '', 'tag': 'item', 'xmlinfo': {'ownid': 6, 'parentid': 2}, 'xmlattb': {}}
  {'text': 'Who', 'tail': '', 'tag': 'item', 'xmlinfo': {'ownid': 7, 'parentid': 2}, 'xmlattb': {}}
  {'text': 'buys', 'tail': 'WonderWidgets1', 'tag': 'em', 'xmlinfo': {'ownid': 8, 'parentid': 7}, 'xmlattb': {}}

上面的代码将给出生成器。当你迭代它时；您将获得dict密钥中的信息；像tag, text,和xmlattb中tail的附加信息xmlinfo。这里root元素的parentid信息为0.

score 1 · Accepted Answer

为此使用XML 解析器。例如，

import xml.etree.ElementTree as ET
doc = ET.parse('people.xml')
names = [name.text for name in doc.findall('.//name')]
ages = [age.text for age in doc.findall('.//age')]
people = dict(zip(names,ages))
print(people)
# {'Billy': '12', 'John': '14', 'Kevin': '10'}

score 0 · Accepted Answer

在我看来，这是一个学习如何手动解析 XML 的练习，而不是简单地从包里拿出一个库来为您解析。如果我错了，我建议观看 Steve Huffman 的 udacity 视频，可以在这里找到：http ://www.udacity.com/view#Course/cs253/CourseRev/apr2012/Unit/362001/Nugget/365002 。他解释了如何使用 minidom 模块来解析诸如此类的轻量级 xml 文件。

现在，我想在回答中提出的第一点是，您不想创建一个 python 字典来打印所有这些值。python 字典只是一组与值相对应的键。它们没有顺序，因此按它们在文件中出现的顺序遍历是一件很痛苦的事。您正在尝试打印出所有姓名及其相应的年龄，因此像元组列表这样的数据结构可能更适合整理您的数据。

似乎您的 xml 文件的结构是这样的，每个名称标签后面都有一个与之对应的年龄标签。每行似乎也只有一个名称标签。这使事情变得相当简单。我不会为这个问题写出最有效或最通用的解决方案，而是尽量让代码尽可能简单易懂。

所以让我们首先创建一个列表来存储数据：

然后让我们创建一个列表来存储数据：a_list = []

现在打开你的文件，初始化几个变量来保存每个名字和年龄：

from __future__ import with_statement

with open("/people.xml") as f:
    name, age = None, None #initialize a name and an age variable to be used during traversals.
    for line in f:
        name = extract_name(line,name) # This function will be defined later.
        age = extract_age(line) # So will this one.
        if age: #We know that if age is defined, we can add a person to our list and reset our variables
            a_list.append( (name,age) ) # and now we can re-initialize our variables.
            name,age = None , None # otherwise simply read the next line until age is defined.

现在对于文件中的每一行，我们想确定它是否包含用户。如果是这样，我们想提取名称。让我们创建一个用于执行此操作的函数：

def extract_name(a_line,name): #we pass in the line as well as the name value that that we defined before beginning our traversal.
    if name: # if the name is predefined, we simply want to keep the name at its current value. (we can clear it upon encountering the corresponding age.)
        return name
    if not "<name>" in a_line: #if no "<name>" in a_line, return. otherwise, extract new name.
        return
    name_pos = a_line.find("<name>")+6
    end_pos = a_line.find("</name>")
    return a_line[name_pos:end_pos]

现在，我们必须创建一个函数来解析用户年龄的行。我们可以用与前面的函数类似的方式来做这件事，但是我们知道一旦我们有了一个年龄，它就会立即被添加到列表中。因此，我们永远不需要关心年龄的先前值。因此，该函数可能如下所示：

def extract_age(a_line):
    if not "<age>" in a_line: #if no "<age>" in a_line:
        return
    age_pos = a_line.find("<age>")+5 # else extract age from line and return it.
    end_pos = a_line.find("</age>")
    return a_line[age_pos:end_pos]

最后，您要打印列表。你可以这样做：

for item in a_list:
    print '\t'.join(item)

希望这有帮助。我还没有测试我的代码，所以它可能仍然有点错误。不过，这些概念就在那里。:)

score 0 · Accepted Answer

这是使用lxml库的另一种方式：

from lxml import objectify


def xml_to_dict(xml_str):
    """ Convert xml to dict, using lxml v3.4.2 xml processing library, see http://lxml.de/ """
    def xml_to_dict_recursion(xml_object):
        dict_object = xml_object.__dict__
        if not dict_object:  # if empty dict returned
            return xml_object
        for key, value in dict_object.items():
            dict_object[key] = xml_to_dict_recursion(value)
        return dict_object
    return xml_to_dict_recursion(objectify.fromstring(xml_str))

xml_string = """<?xml version="1.0" encoding="UTF-8"?><Response><NewOrderResp>
<IndustryType>Test</IndustryType><SomeData><SomeNestedData1>1234</SomeNestedData1>
<SomeNestedData2>3455</SomeNestedData2></SomeData></NewOrderResp></Response>"""

print xml_to_dict(xml_string)

要保留父节点，请改用：

def xml_to_dict(xml_str):
    """ Convert xml to dict, using lxml v3.4.2 xml processing library, see http://lxml.de/ """
    def xml_to_dict_recursion(xml_object):
        dict_object = xml_object.__dict__
        if not dict_object:  # if empty dict returned
            return xml_object
        for key, value in dict_object.items():
            dict_object[key] = xml_to_dict_recursion(value)
        return dict_object
    xml_obj = objectify.fromstring(xml_str)
    return {xml_obj.tag: xml_to_dict_recursion(xml_obj)}

如果您只想返回一个子树并将其转换为 dict，您可以使用Element.find()：

xml_obj.find('.//')  # lxml.objectify.ObjectifiedElement instance

请参阅lxml 文档。

python - Python：将信息从 xml 提取到字典

4 回答 4

Related

Reference