python - ElementTree 和 unicode

Question

我在 xml 文件中有这个字符：

<data>
  <products>
      <color>fumè</color>
  </product>
</data>

我尝试使用以下代码生成 ElementTree 的实例：

string_data = open('file.xml')
x = ElementTree.fromstring(unicode(string_data.encode('utf-8')))

我收到以下错误：

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 185: ordinal not in range(128)

（注意：位置不准确，我从一个较大的 xml 中取样）。

如何解决？谢谢

score 35 · Accepted Answer

您可能在使用Requests (HTTP for Humans)时偶然发现了这个问题response.text，默认情况下会解码响应，您可以使用它response.content来获取未解码的数据，因此 ElementTree 可以自行解码。请记住使用正确的编码。

更多信息：http ://docs.python-requests.org/en/latest/user/quickstart/#response-content

score 15 · Accepted Answer

您需要将utf-8 字符串解码为 unicode 对象。所以

string_data.encode('utf-8')

应该

string_data.decode('utf-8')

假设string_data实际上是一个 utf-8 字符串。

总结一下：要从 unicode 对象中获取 utf-8 字符串，您可以对 unicode进行编码（使用 utf-8 编码），并将字符串转换为 unicode 对象，您可以使用相应的编码对字符串进行解码。

有关这些概念的更多详细信息，我建议阅读每个软件开发人员绝对、肯定必须了解 Unicode 和字符集（不是特定于 Python）的绝对最小值。

score 12 · Accepted Answer

您无需为 ElementTree 解码 XML 即可工作。XML 带有自己的编码信息（默认为 UTF-8），ElementTree 为您完成工作，输出 unicode：

>>> data = '''\
... <data>
...   <products>
...       <color>fumè</color>
...   </products>
... </data>
... '''
>>> x = ElementTree.fromstring(data)
>>> x[0][0].text
u'fum\xe8'

如果您的数据包含在文件（类似）对象中，只需将文件名或文件对象直接传递给ElementTree.parse()函数：

x = ElementTree.parse('file.xml')

score 2 · Accepted Answer

您是否尝试过使用该parse功能，而不是打开文件......（顺便说一句，它需要.read()在它之后.fromstring()才能工作......）

import xml.etree.ElementTree as ET

tree = ET.parse('file.xml')
root = tree.getroot()
# etc...

score 1 · Accepted Answer

1

您的文件很可能不是 UTF-8。è例如，字符可以来自其他编码latin-1。

于 2012-09-10T10:28:56.943 回答

score 1 · Accepted Answer

1

函数open()不返回string. 而是使用open('file.xml').read().

于 2014-03-10T10:07:35.070 回答

python - ElementTree 和 unicode

6 回答 6

Related

Reference