python - Python 无法使用 surrogateescape 进行编码

Question

我对 Python (3.4) 中的 Unicode 代理编码有疑问：

>>> b'\xCC'.decode('utf-16_be', 'surrogateescape').encode('utf-16_be', 'surrogateescape')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-16-be' codec can't encode character '\udccc' in position 0: surrogates not allowed

如果我没记错的话，根据Python 文档：

'surrogateescape'：解码时，将字节替换为从 U+DC80 到 U+DCFF 的各个代理代码。当编码数据时使用“surrogateescape”错误处理程序时，此代码将被转换回相同的字节。

代码应该只生成源序列 ( b'\xCC')。那么为什么会引发异常呢？

这可能与我的第二个问题有关：

在 3.4 版中更改：utf-16* 和 utf-32* 编码器不再允许对代理代码点 (U+D800–U+DFFF) 进行编码。

（来自https://docs.python.org/3/library/codecs.html#standard-encodings）

据我所知，如果没有代理对，就不可能将某些代码点编码为 UTF-16。那么这背后的原因是什么？

score 6 · Accepted Answer

进行此更改是因为Unicode 标准明确禁止此类编码。请参阅问题 #12892，但显然surrogateescape错误处理程序无法与 UTF-16 或 UTF-32 一起使用，因为这些编解码器不兼容 ASCII。

具体来说：

我测试了 utf_16_32_surrogates_4.patch: surrogateescape with as encoder 没有按预期工作。
>>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'ignore')
'[]'
>>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'replace')
'[�]'
>>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'surrogateescape')
'[\udc80\udcdc\uffff'
=> 我期待'[\udc80\udcdc]'。

得到的回应是：

是的，surrogateescape 不适用于 ASCII 不兼容的编码并且不能。首先，它不能代表b'\x00\xd8'从 utf-16-le 或b'ABCD'从 utf-32* 解码的结果。这个问题值得在 Python-Dev 上分开 issue（甚至 PEP）和讨论。

我相信surrogateescape处理程序更适用于 UTF-8 数据；现在解码为 UTF-16 或 UTF-32 也可以使用它是一个很好的附加功能，但显然它不能在另一个方向上工作。

score 2 · Accepted Answer

如果您使用surrogatepass( 而不是surrogateescape)，则应该在 Python 3 上运行。

请参阅：https ://docs.python.org/3/library/codecs.html#codec-base-classes （表示surrogatepass允许对代理代码进行编码和解码（用于utf相关编码）。

python - Python 无法使用 surrogateescape 进行编码

2 回答 2

Related

Reference