python - xml.etree 以意外方式将 xml 写入文件

Question

我xml.etree.ElementTree用来解析和更改 utf-8 xml 文件。其中 2 个问题是因为文件是以 Unix 文件格式而不是 Windows 格式编写的。问题 1 很明显，行结尾\n不是\r\n. 问题 2 是由于不同的文件格式（我假设），utf-8 字符串的呈现方式不同。如何强制该write()功能以 Windows 文件格式保存？我目前使用write()如下：

    # -*- coding: utf-8 -*-
    import xml.etree.ElementTree as ET
    import sys

    altSpellingTree = ET.parse(sys.argv[2])
    altSpellingRoot = altSpellingTree.getroot()
    recordList = altSpellingRoot.findall("record") # Grab all <record> elements and iterate
    for record in recordList:
        # Check for the existence of an <alternative_spelling> element
        alt_spelling_node = record.find("person").find("names").find("alternative_spelling")
        if alt_spelling_node == None:
            continue
        else:
            # Check if <alternative_spelling> element text is solely ","
            if alt_spelling_node.text == ",":
                alt_spelling_node.text = None # Remove the lone comma
    altSpellingTree.write(sys.argv[2], encoding="utf-8", xml_declaration=True)

第三个问题是输出的文件使用自闭合标签，其中曾经有一个开始和一个结束标签（例如<Country></Country>变成<Country />）。有没有办法防止这种情况发生？

-------编辑--------
这是程序运行之前 XML 的 2 个示例：

    <Country></Country>
    <Category_Type></Category_Type>
    <Standard></Standard>

    <names>
      <first_name>Fernando</first_name>
      <last_name>ROMERO AVILA</last_name>
      <aliases>
        <alias xsi:nil="true" />
      </aliases>
      <low_quality_aliases>
        <alias xsi:nil="true" />
      </low_quality_aliases>
      <alternative_spelling>ROMERO ÁVILA,Fernando</alternative_spelling>
    </names>

程序运行后同样的2个样本：

    <Country />
    <Category_Type />
    <Standard />

    <names>
      <first_name>Fernando</first_name>
      <last_name>ROMERO AVILA</last_name>
      <aliases>
        <alias xsi:nil="true" />
      </aliases>
      <low_quality_aliases>
        <alias xsi:nil="true" />
      </low_quality_aliases>
      <alternative_spelling>ROMERO ÃVILA,Fernando</alternative_spelling>
    </names>

score 1 · Accepted Answer

如果有任何错误，我还没有测试你的代码，但是为了避免自闭标签，改变这个：

altSpellingTree.write(sys.argv[2], encoding="utf-8", xml_declaration=True)

至

altSpellingTree.write(sys.argv[2], encoding="utf-8", xml_declaration=True, method="html")

应该做的伎俩。

为了大大简化您的代码，您可以使用它iter来搜索您的树，如下所示：

import xml.etree.ElementTree as ET

tree = ET.parse('your.xml')

for el in tree.iter('alternative_spelling'):
    # check your el text or whatever
    if el.text == u",":
        el.text = ""
    print el.text

python - xml.etree 以意外方式将 xml 写入文件

1 回答 1

Related

Reference