python - 循环遍历标签和编写 XML 时出现 Unicode 错误

Question

我正在尝试写出一些确实有一些特殊字符的 XML。我遇到麻烦的地方是当我遍历标签列表以创建几个称为标签的元素时。

# -*- coding: utf-8 -*-
import xml.etree.ElementTree as xml

reload(sys)
sys.setdefaultencoding('utf-8')

代码片段：

    check = (video['tags'].split(', '))
    x=len(check)
    y=x-1
    for i in xrange(0,y):
        tagger = xml.SubElement(doc, 'field', name="tag")
        s=check[i]
        tagger.text = s.encode('utf-8')

问题是当我尝试写：

output = open(file_name,'w+')
tree = xml.ElementTree(add)
tree.write(output)
output.close()

我收到以下错误：

Traceback (most recent call last):
  File "xml_breakup3.py", line 108, in <module>
    tagger.text = s.encode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 0: invalid start byte

当我在没有这个片段的情况下运行我的代码时，它会毫无问题地写入 xml。如果我让 tagger.text = 任何类型的字符串（即'99'）它写得很好。如果我让循环从 0 变为 3，它就可以工作。只有当我尝试遍历整个列表时，我才会收到 UnicodeDecode 错误

当我尝试时：

    check = (video['tags'].split(', '))
    for ta in check:
        tagger = xml.SubElement(doc, 'field', name="tag")
        tagger.text = ta

我明白了：

     Traceback (most recent call last):
       File "xml_breakup3.py", line 172, in <module>
         tree.write(output)
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 821, in write
    serialize(write, self._root, encoding, qnames, namespaces)
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 940, in _serialize_xml
     _serialize_xml(write, e, encoding, qnames, None)
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 940, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 938, in _serialize_xml
    write(_escape_cdata(text, encoding))
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1074, in _escape_cdata
    return text.encode(encoding, "xmlcharrefreplace")

UnicodeDecodeError：“utf8”编解码器无法解码位置 0 的字节 0xba：无效的起始字节

score 0 · Accepted Answer

您可能想尝试str从您正在编码的片段前面删除。当您使用时str，您正在将我假设的 Unicode 转换为字符串，然后您正在尝试对其进行编码。如果您将其保留为 Unicode 并直接解码，它应该可以工作：

>>> s = u'\xba'
>>> print s
º
>>> s.encode('utf8')
'\xc2\xba'
>>> str(s).encode('utf8')

Traceback (most recent call last):
  File "<pyshell#30>", line 1, in <module>
    str(s).encode('utf8')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xba' in position 0: ordinal not in range(128)

python - 循环遍历标签和编写 XML 时出现 Unicode 错误

1 回答 1

Related

Reference