我将从这个问题开始:“我可以使用一个可能不那么严格并允许使用 utf-8 字符的替代解析器吗?”
所有 XML 解析器都将接受以 UTF-8 编码的数据。事实上,UTF-8 是默认编码。
XML 文档可能以这样的声明开头:
`<?xml version="1.0" encoding="UTF-8"?>`
或者像这样:
<?xml version="1.0"?>
或者根本没有声明......在每种情况下,解析器都将使用 UTF-8 解码文档。
但是,您的数据不是以 UTF-8 编码的……它可能是 Windows-1252 aka cp1252。
如果编码不是 UTF-8,那么创建者应该包含一个声明(或者接收者可以添加一个声明),或者接收者可以将数据转码为 UTF-8。以下展示了哪些有效,哪些无效:
>>> import xml.etree.ElementTree as ET
>>> from StringIO import StringIO as sio
>>> raw_text = '<root>can\x92t</root>' # text encoded in cp1252, no XML declaration
>>> t = ET.parse(sio(raw_text))
[tracebacks omitted]
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 9
# parser is expecting UTF-8
>>> t = ET.parse(sio('<?xml version="1.0" encoding="UTF-8"?>' + raw_text))
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 47
# parser is expecting UTF-8 again
>>> t = ET.parse(sio('<?xml version="1.0" encoding="cp1252"?>' + raw_text))
>>> t.getroot().text
u'can\u2019t'
# parser was told to expect cp1252; it works
>>> import unicodedata
>>> unicodedata.name(u'\u2019')
'RIGHT SINGLE QUOTATION MARK'
# not quite an apostrophe, but better than an exception
>>> fixed_text = raw_text.decode('cp1252').encode('utf8')
# alternative: we transcode the data to UTF-8
>>> t = ET.parse(sio(fixed_text))
>>> t.getroot().text
u'can\u2019t'
# UTF-8 is the default; no declaration needed