python - 从 XML 读取标签并打印

Question

我有以下输入 XML 文件，我读取了 rel_notes 标记并打印它...遇到以下错误

输入 XML：

<rel_notes>
    •   Please move to this build for all further test and development activities 
    •   Please use this as base build to verify compilation and sanity before any check-in happens

</rel_notes>

示例python代码：

file = open('data.xml,'r')
from xml.etree import cElementTree as etree
tree = etree.parse(file)
print('\n'.join(elem.text for elem in tree.iter('rel_notes')))

输出

   print('\n'.join(elem.text for elem in tree.iter('rel_notes')))
 File "C:\python2.7.3\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2022' in position 9: character maps to <undefined>

score 1 · Accepted Answer

问题在于将Unicode 打印到 Windows 控制台。即，您的控制台无法使用字符“•”来表示。cp437

要重现该问题，请尝试：

print u'\u2022'

您可以设置PYTHONIOENCODING环境变量以指示 python 用相应的 xml char 引用替换所有不可表示的字符：

T:\> set PYTHONIOENCODING=cp437:xmlcharrefreplace
T:\> python your_script.py

或者在打印之前将文本编码为字节：

print u'\u2022'.encode('cp437', 'xmlcharrefreplace')

^{回答你最初的问题}

要打印每个<build_location/>元素的文本：

import sys
from xml.etree import cElementTree as etree

input_file = sys.stdin # filename or file object
tree = etree.parse(input_file)
print('\n'.join(elem.text for elem in tree.iter('build_location')))

如果输入文件很大；iterparse()可以使用：

import sys
from xml.etree import cElementTree as etree

input_file = sys.stdin
context = iter(etree.iterparse(input_file, events=('start', 'end')))
_, root = next(context) # get root element
for event, elem in context:
    if event == 'end' and elem.tag == 'build_location':
       print(elem.text)
       root.clear() # free memory

score 0 · Accepted Answer

我不认为上面的整个片段是完全有帮助的。但是，UnicodeEncodeError 通常发生在未正确处理 ASCII 字符时。

例子：

unicode_str = html.decode(<source encoding>)

encoded_str = unicode_str.encode("utf8")

它已经在这个答案中得到了清楚的解释：Python: Convert Unicode to ASCII without errors

这至少应该解决 UnicodeEncodeError。

python - 从 XML 读取标签并打印

2 回答 2

例子：

Related

Reference