python - 使用 UTF-8 进行输出时，Python ElementTree 不会转换不间断空格

Question

我正在尝试使用 Python 的 ElementTree 解析、操作和输出 HTML：

import sys
from cStringIO  import StringIO
from xml.etree  import ElementTree as ET
from htmlentitydefs import entitydefs

source = StringIO("""<html>
<body>
<p>Less than &lt;</p>
<p>Non-breaking space &nbsp;</p>
</body>
</html>""")

parser = ET.XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity.update(entitydefs)
etree = ET.ElementTree()

tree = etree.parse(source, parser=parser)
for p in tree.findall('.//p'):
    print ET.tostring(p, encoding='UTF-8')

当我在 Mac OS X 10.6 上使用 Python 2.7 运行它时，我得到：

<p>Less than &lt;</p>

Traceback (most recent call last):
  File "bar.py", line 20, in <module>
    print ET.tostring(p, encoding='utf-8')
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1120, in tostring
    ElementTree(element).write(file, encoding, method=method)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 815, in write
    serialize(write, self._root, encoding, qnames, namespaces)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 931, in _serialize_xml
    write(_escape_cdata(text, encoding))
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1067, in _escape_cdata
    return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 19: ordinal not in range(128)

我认为指定“encoding='UTF-8'”会处理不间断的空格字符，但显然它没有。我应该怎么做？

score 7 · Accepted Answer

0xA0 是 latin1 字符，而不是 unicode 字符，并且循环中 p.text 的值是 str 而不是 unicode，这意味着为了将其编码为 utf-8，它必须首先由 Python 隐式转换为 unicode字符串（即使用解码）。当它这样做时，它假定为 ascii，因为它没有被告知其他任何内容。0xa0 不是有效的 ascii 字符，但它是有效的 latin1 字符。

您使用 latin1 字符而不是 unicode 字符的原因是 entitydefs 是名称到 latin1 编码字符串的映射。您需要可以从 htmlentitydef.name2codepoint 获得的 unicode 代码点

下面的版本应该为您修复它：

import sys
from cStringIO  import StringIO
from xml.etree  import ElementTree as ET
from htmlentitydefs import name2codepoint

source = StringIO("""<html>
<body>
<p>Less than &lt;</p>
<p>Non-breaking space &nbsp;</p>
</body>
</html>""")

parser = ET.XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity.update((x, unichr(i)) for x, i in name2codepoint.iteritems())
etree = ET.ElementTree()

tree = etree.parse(source, parser=parser)
for p in tree.findall('.//p'):
    print ET.tostring(p, encoding='UTF-8')

score 4 · Accepted Answer

XML 仅定义<、>、'和. 和其他来自HTML。所以你有几个选择。"& 

您可以将源更改为使用数字实体，例如 或 两者都等效于 .
您可以使用定义这些值的 DTD。

XSLT FAQ中有一些有用的信息（它是关于 XSLT 的，但 XSLT 是使用 XML 编写的，所以同样适用）。

现在问题似乎包括堆栈跟踪；这改变了事情。你确定字符串在UTF-8吗？如果它解析为单个 byte 0xA0，那么它不是UTF-8但更有可能是cp1252or iso-8859-1。

score 3 · Accepted Answer

您 正在转换为 '\xa0'，这是不间断空格的默认 (ascii) 编码（UTF-8 编码为 '\xc2\xa0'。）

'\xa0'.encode('utf-8')

导致 UnicodeDecodeError，因为默认编解码器 ascii 最多只能工作 128 个字符并且 ord('\xa0') = 160。将默认编码设置为其他内容，即：

import sys
reload(sys)  # must reload sys to use 'setdefaultencoding'
sys.setdefaultencoding('latin-1')

print '\xa0'.encode('utf-8', "xmlcharrefreplace")

应该可以解决您的问题。

score -1 · Accepted Answer

HTML 与 XML 不同，因此 like 标记 将不起作用。理想情况下，如果您尝试通过 XML 传递该信息，您可以首先对上述数据进行 xml 编码，因此它看起来像这样：

<xml>
<mydata>
&lt;htm&gt;
&lt;body&gt;
&lt;p&gt;Less than &amp;lt;&lt;/p&gt;
&lt;p&gt;Non-breaking space &amp;nbsp;&lt;/p&gt;
&lt;/body&gt;
&lt;/html&gt;
</mydata>
</xml>

然后在解析 XML 之后，您可以对字符串进行 HTML 解编码。

score -1 · Accepted Answer

我认为您在这里遇到的问题不在于您的 nbsp 实体，而在于您的打印语句。

你的错误是：

UnicodeDecodeError：“ascii”编解码器无法解码位置 19 中的字节 0xa0：序数不在范围内（128）

我认为这是因为您正在获取一个 utf-8 字符串（来自ET.tostring(p, encoding='utf-8')）并试图在 ascii 终端中回显它。所以 Python 隐式地将该字符串转换为 unicode，然后再将其转换为 ascii。nbsp虽然可以直接用utf-8表示，但不能直接用ascii表示。因此错误。

尝试将输出保存到文件中，看看是否得到了预期的结果。

或者， try print ET.toString(p, encoding='ascii')，这应该会导致 ElementTree 使用数字字符实体来表示无法用 ascii 表示的任何内容。

python - 使用 UTF-8 进行输出时，Python ElementTree 不会转换不间断空格

5 回答 5

Related

Reference