2

我有一个ElementTree实例,我正在尝试使用以下tostring方法输出到文本:

tostring(root, encoding='UTF-8')

我得到一个UnicodeDecodeError(下面的回溯),因为其中一个Element.text节点具有该u'\u2014'字符。我将文本属性设置如下:

my_str = u'\u2014'
el.text = my_str.encode('UTF-8')

如何成功地将树序列化为文本?我是否错误地编码了节点?谢谢。

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "crisis_app/converters/to_xml.py", line 129, in convert
    return tostring(root, encoding='UTF-8')
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1127, in tostring
    ElementTree(element).write(file, encoding, method=method)
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 821, in write
    serialize(write, self._root, encoding, qnames, namespaces)
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 940, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 940, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 940, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 938, in _serialize_xml
    write(_escape_cdata(text, encoding))
  File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1074, in _escape_cdata
    return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 288: ordinal not in range(128)
4

1 回答 1

2

如果你这样做:

my_str = u'\u2014'
el.text = my_str.encode('UTF-8')

您将文本设置为 unicode 字符的 utf-8 编码版本。这是一样的

el.text = '\xe2\x80\x94'

现在您不再有 unicode 字符,而是一系列字节。

如果你这样做:

tostring(root, encoding='UTF-8')

你是说你想要编码为 utf-8 的内容。为此,在内部必须首先使用默认编码(ascii)将字符串解码为 un​​icode,然后编码为 utf-8,这当然会失败,因为字符串中的字节不在 ascii 范围内。

ElementTree 完全可以使用 unicode,所以只需给它 unicode 而不是 str:

>>> from xml.etree import ElementTree as et
>>> e = et.Element('test')
>>> e.text = u'\u2014'

>>> s = et.tostring(e)
>>> print s, repr(s)
<test>&#8212;</test> '<test>&#8212;</test>'

>>> s = et.tostring(e, encoding='utf-8')
>>> print s, repr(s)
<test>—&lt;/test> '<test>\xe2\x80\x94</test>'
于 2013-07-10T20:52:35.023 回答