python - 使python默认用字符串替换不可编码的字符

Question

我想让python忽略它无法编码的字符，只需用字符串替换它们"<could not encode>"。

例如，假设默认编码是 ascii，命令

'%s is the word'%'ébác'

会产生

'<could not encode>b<could not encode>c is the word'

在我的所有项目中，有什么方法可以使它成为默认行为？

score 11 · Accepted Answer

该str.encode函数接受一个定义错误处理的可选参数：

str.encode([encoding[, errors]])

从文档：

返回字符串的编码版本。默认编码是当前的默认字符串编码。可能会给出错误以设置不同的错误处理方案。错误的默认值为“严格”，这意味着编码错误会引发 UnicodeError。其他可能的值是 'ignore'、'replace'、'xmlcharrefreplace'、'backslashreplace' 和通过 codecs.register_error() 注册的任何其他名称，请参阅 Codec Base Classes 部分。有关可能的编码列表，请参阅标准编码部分。

在您的情况下，该codecs.register_error功能可能很有趣。

[关于坏字符的注意事项]

顺便说一句，在使用时请注意register_error，除非您注意，否则您可能会发现自己不仅会替换单个坏字符，还会用字符串替换成组的连续坏字符。每次运行错误字符时，您都会调用一次错误处理程序，而不是每个字符。

score 5 · Accepted Answer

>>> help("".encode)
Help on built-in function encode:

encode(...)
S.encode([encoding[,errors]]) -> object

Encodes S using the codec registered for encoding. encoding defaults
to the default encoding. errors may be given to set a different error
handling scheme. Default is 'strict' meaning that encoding errors raise
a UnicodeEncodeError. **Other possible values are** 'ignore', **'replace'** and
'xmlcharrefreplace' as well as any other name registered with
codecs.register_error that is able to handle UnicodeEncodeErrors.

因此，例如：

>>> x
'\xc3\xa9b\xc3\xa1c is the word'
>>> x.decode("ascii")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
>>> x.decode("ascii", "replace")
u'\ufffd\ufffdb\ufffd\ufffdc is the word'

将您自己的回调添加到 codecs.register_error 以替换为您选择的字符串。

python - 使python默认用字符串替换不可编码的字符

2 回答 2

Related

Reference