python - python ElementTree解码错误

Question

我有一个ElementTree实例，我正在尝试使用以下tostring方法输出到文本：

tostring(root, encoding='UTF-8')

我得到一个UnicodeDecodeError（下面的回溯），因为其中一个Element.text节点具有该u'\u2014'字符。我将文本属性设置如下：

my_str = u'\u2014'
el.text = my_str.encode('UTF-8')

如何成功地将树序列化为文本？我是否错误地编码了节点？谢谢。

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "crisis_app/converters/to_xml.py", line 129, in convert
    return tostring(root, encoding='UTF-8')
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1127, in tostring
    ElementTree(element).write(file, encoding, method=method)
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 821, in write
    serialize(write, self._root, encoding, qnames, namespaces)
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 940, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 940, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 940, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 938, in _serialize_xml
    write(_escape_cdata(text, encoding))
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1074, in _escape_cdata
    return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 288: ordinal not in range(128)

score 2 · Accepted Answer

如果你这样做：

my_str = u'\u2014'
el.text = my_str.encode('UTF-8')

您将文本设置为 unicode 字符的 utf-8 编码版本。这是一样的

el.text = '\xe2\x80\x94'

现在您不再有 unicode 字符，而是一系列字节。

如果你这样做：

tostring(root, encoding='UTF-8')

你是说你想要编码为 utf-8 的内容。为此，在内部必须首先使用默认编码（ascii）将字符串解码为 unicode，然后编码为 utf-8，这当然会失败，因为字符串中的字节不在 ascii 范围内。

ElementTree 完全可以使用 unicode，所以只需给它 unicode 而不是 str：

>>> from xml.etree import ElementTree as et
>>> e = et.Element('test')
>>> e.text = u'\u2014'

>>> s = et.tostring(e)
>>> print s, repr(s)
<test>&#8212;</test> '<test>&#8212;</test>'

>>> s = et.tostring(e, encoding='utf-8')
>>> print s, repr(s)
<test>—&lt;/test> '<test>\xe2\x80\x94</test>'

python - python ElementTree解码错误

1 回答 1

Related

Reference