python - Python：当它包含代理对时获得正确的字符串长度

Question

考虑以下关于 IPython 的交流：

In [1]: s = u'華袞與緼同歸'

In [2]: len(s)
Out[2]: 8

正确的输出应该是7，但是因为这七个汉字中的第五个具有高 Unicode 代码点，所以它在 UTF-8 中由“代理对”表示，而不仅仅是一个简单的代码点，因此 Python认为它是两个字符而不是一个字符。

即使我使用unicodedata，它将代理对正确地作为单个代码点（\U00026177）返回，当传递给len()错误的长度时仍然返回：

In [3]: import unicodedata

In [4]: unicodedata.normalize('NFC', s)
Out[4]: u'\u83ef\u889e\u8207\u7dfc\U00026177\u540c\u6b78'


In [5]: len(unicodedata.normalize('NFC', s))
Out[5]: 8

如果不采取像为 UTF-32 重新编译 Python 这样的激烈步骤，有没有一种简单的方法可以在这种情况下获得正确的长度？

我在 IPython 0.13、Python 2.7.2、Mac OS 10.8.2 上。

score 8 · Accepted Answer

我认为这已在 3.3 中修复。看：

http://docs.python.org/py3k/whatsnew/3.3.html
http://www.python.org/dev/peps/pep-0393/（搜索wstr_length）

score 7 · Accepted Answer

我在 Python 2 上创建了一个函数来执行此操作：

SURROGATE_PAIR = re.compile(u'[\ud800-\udbff][\udc00-\udfff]', re.UNICODE)
def unicodeLen(s):
  return len(SURROGATE_PAIR.sub('.', s))

通过用单个字符替换代理对，我们“修复”了该len函数。在普通字符串上，这应该非常有效：由于模式不匹配，原始字符串将不加修改地返回。它也应该适用于宽（32 位）Python 版本，因为不会使用代理对编码。

score 3 · Accepted Answer

您可以覆盖 Python 中的 len 函数（请参阅：len 如何工作？）并在其中添加一个 if 语句来检查超长的 unicode。

python - Python：当它包含代理对时获得正确的字符串长度

3 回答 3

Related

Reference