python - 尽管“errors = 'replace'”，python utf-8 编码仍会引发 UnicodeDecodeError

Question

我正在尝试使用以下代码写出一些文本并尽可能将其编码为 utf-8：

outf.write((lang_name + "," + (script_name or "") + "\n").encode("utf-8", errors='replace'))

我收到以下错误：

File "C:\Python27\lib\encodings\cp1252.py", line 15, in decode 
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 6: character maps to <undefined>

我认为errors='replace'我的编码调用部分会处理这个问题？

fwiw，我只是打开文件

outf = open(outfile, 'w')

没有明确声明编码。

print repr(outf)

产生：

<open file 'myfile.csv', mode 'w' at 0x000000000315E930>

我将 write 语句分离为单独的串联、编码和文件写入：

outstr = lang_name + "," + (script_name or "") + "\n"
encoded_outstr = outstr.encode("utf-8", errors='replace')
outf.write(encoded_outstr)

引发异常的是串联。

字符串是，通过print repr(foo)

lang_name: 'G\xc4\x81ndh\xc4\x81r\xc4\xab'
script_name: u'Kharo\u1e63\u1e6dh\u012b'

进一步的侦探工作表明，我可以毫无困难地将其中任何一个与普通的 ascii 字符串连接起来——它将它们都放入同一个字符串中，这会破坏一些东西。

score 2 · Accepted Answer

所以，问题是你正在连接 bytestring'G\xc4\x81ndh\xc4\x81r\xc4\xab'和 Unicode string u'Kharo\u1e63\u1e6dh\u012b'。

为了做到这一点，Python 2.7 尝试使用其默认编码解码字节串，将其转换为 Unicode。您的默认编码是 cp1252 而不是 ASCII，原因我无法从这里知道，但无论如何它都会失败，就像它是 ASCII 一样，因为该字符串是 UTF8。

您最好的解决方案可能是通过首先更改变量获取这些值的方式来确保不会发生这种情况。

如果你不能，因为无论如何你在下一行编码为 UTF8，只编码 script_name 可能是最简单的：

encoded_outstr = lang_name + b"," + (script_name.encode('utf-8') or b"") + b"\n"

请注意，我曾经b","明确地将这些字符串文字设为字节字符串，而不是 Unicode 字符串；如果您使用from __future__ import unicode_literals的是 Python 3 兼容性，那么默认情况下它们是 Unicode 并且问题会再次发生。

score 2 · Accepted Answer

当你连接一个字节字符串和一个 Unicode 字符串时，Python 2 会首先尝试将字节字符串转换为 Unicode。\x80如果字节字符串包含to范围内的任何非 ASCII 字符\xff，则自动转换将失败并显示您显示的错误。请注意，它说的是can't decode，not-can't encode这表明错误没有发生在您的调用中encode。

解决方案是自己decode将字节字符串转换为 Unicode，使用适当的代码页，以便连接的所有输入都是 Unicode 字符串。

outstr = lang_name.decode("utf-8") + u"," + (script_name or u"") + u"\n"

python - 尽管“errors = 'replace'”，python utf-8 编码仍会引发 UnicodeDecodeError

2 回答 2

Related

Reference