python - Python unicode 索引显示不同的字符

Question

我在包含 Unicode 字符的 Python 2.7.10 的“窄”版本中有一个 Unicode 字符串。我正在尝试使用该 Unicode 字符作为字典中的查找，但是当我索引字符串以获取最后一个 Unicode 字符时，它返回一个不同的字符串：

>>> s = u'Python is fun \U0001f44d'
>>> s[-1]
u'\udc4d'

为什么会发生这种情况，如何'\U0001f44d'从字符串中检索？

编辑：unicodedata.unidata_version是 5.2.0 并且sys.maxunicode是 65535。

score 3 · Accepted Answer

看起来您的 Python 2 构建使用代理来表示基本多语言平面之外的代码点。参见例如如何在 Python 中使用代理对？了解一点背景知识。

我的建议是尽快切换到 Python 3 来处理任何涉及字符串处理的问题。

score 2 · Accepted Answer

Python 2“窄”构建使用 UTF-16 存储 Unicode 字符串（所谓的泄漏抽象，因此代码点 >U+FFFF 是两个 UTF 代理项。要检索代码点，您必须同时获取前导和尾随代理：

Python 2.7.14 (v2.7.14:84471935ed, Sep 16 2017, 20:25:58) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> s = u'Python is fun \U0001f44d'
>>> s[-1]     # Just the trailing surrogate
u'\udc4d'
>>> s[-2:]    # leading and trailing
u'\U0001f44d'

切换到问题已解决的 Python 3.3+，并且未公开 Unicode 字符串中 Unicode 代码点的存储详细信息：

Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> s = u'Python is fun \U0001f44d'
>>> s[-1]   # code points are stored in Unicode strings.
'\U0001f44d'

python - Python unicode 索引显示不同的字符

2 回答 2

Related

Reference