python - PyQt 和 UTF-16 代理

Question

考虑以下 Python 代码：

In [1]: from PyQt4.QtCore import QTextCodec

In [2]: import codecs

In [3]: surrogate_raw = b'\x34\xd8\x1e\xdd'  # this is UTF-16 encoded character (surrogate pair)

In [4]: QTextCodec.codecForName('utf-16le').toUnicode(surrogate_raw)
Out[4]: '\ud834\udd1e'

In [5]: codecs.getdecoder('utf-16le')(surrogate_raw)
Out[5]: ('', 4)

如您所见，从 QTextCodec::toUnicode 返回的字符串不正确。有两个字符的代码点等于代理值而不是单个字符。它不是正确的 unicode（由于保留的代码点），并且该字符串无法转换为另一种编码。例如，无法在使用 utf-8 的 linux 控制台中打印字符串：

In [6]: incorr = QTextCodec.codecForName('utf-16le').toUnicode(surrogate_raw)

In [7]: print(incorr)
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-7-556b71617dae> in <module>()
----> 1 print(incorr)

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud834' in position 0: surrogates not allowed

所有 QString 都存在相同的行为，即使是从 SIP 包装的自定义代码返回的行为。我已经在带有 Python 3.3.1 的 Ubuntu 13.04 64 位和带有 Python 3.3.0 的 Windows XP 32 位上测试了此代码。

我可以猜到这个错误来自哪里——在 Qt 中，字符串总是用 UTF-16 表示，你应该手动测试 QChar 是前导还是尾随代理。但是 Python 字符串是不同的，并且包装器进行了错误的转换。

另一方面，PyQt 在全球范围内被广泛使用了很长时间，并且像这样的基本类中的任何错误都会被发现并修复。所以我认为错误的原因是我做错了什么。我的错误在哪里？

python - PyQt 和 UTF-16 代理

0 回答 0

Related

Reference