python - 在 Python 2 + GTK 中检测/删除不成对的代理字符

Question

在 Python 2.7 中，我可以成功地将 Unicode 字符串转换"abc\udc34xyz"为 UTF-8（结果为"abc\xed\xb0\xb4xyz"）。但是当我将 UTF-8 字符串传递给例如。pango_parse_markup()或者g_convert_with_fallback()，我收到诸如“转换输入中的无效字节序列”之类的错误。显然 GTK/Pango 函数检测到字符串中的“未配对代理”并（正确？）拒绝它。

Python 3 甚至不允许将 Unicode 字符串转换为 UTF-8（错误：“'utf-8' codec can't encode character '\udc34' in position 3: surrogates not allowed”），但我可以"abc\udc34xyz".encode("utf8", "replace")运行获取一个有效的 UTF8 字符串，其中唯一的代理项被其他字符替换。这对我来说很好，但我需要 Python 2 的解决方案。

所以问题是：在 Python 2.7 中，如何将 Unicode 字符串转换为 UTF-8，同时用一些替换字符（如 U+FFFD）替换单独的代理项？最好只使用标准 Python 函数和 GTK/GLib/G... 函数。

顺便提一句。Iconv 可以将字符串转换为 UTF8，但只是删除坏字符而不是用 U+FFFD 替换它。

score 10 · Accepted Answer

您可以在编码之前自己进行替换：

import re

lone = re.compile(
    ur'''(?x)            # verbose expression (allows comments)
    (                    # begin group
    [\ud800-\udbff]      #   match leading surrogate
    (?![\udc00-\udfff])  #   but only if not followed by trailing surrogate
    )                    # end group
    |                    #  OR
    (                    # begin group
    (?<![\ud800-\udbff]) #   if not preceded by leading surrogate
    [\udc00-\udfff]      #   match trailing surrogate
    )                    # end group
    ''')

u = u'abc\ud834\ud82a\udfcdxyz'
print repr(u)
b = lone.sub(ur'\ufffd',u).encode('utf8')
print repr(b)
print repr(b.decode('utf8'))

输出：

u'abc\ud834\U0001abcdxyz'
'abc\xef\xbf\xbd\xf0\x9a\xaf\x8dxyz'
u'abc\ufffd\U0001abcdxyz'

python - 在 Python 2 + GTK 中检测/删除不成对的代理字符

1 回答 1

Related

Reference