python - python 2.x 中 unicode 字符串的 string.ascii_letters 等价物？

Question

在标准库的“字符串”模块中，

string.ascii_letters ## Same as string.ascii_lowercase + string.ascii_uppercase

是

'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

是否有一个类似的常数可以包括所有被认为是 unicode 字母的东西？

score 11 · Accepted Answer

您可以使用以下方法构建自己的 Unicode 大写和小写字母常量：

import unicodedata as ud
all_unicode = ''.join(unichr(i) for i in xrange(65536))
unicode_letters = ''.join(c for c in all_unicode
                          if ud.category(c)=='Lu' or ud.category(c)=='Ll')

这使得字符串长度为 2153 个字符（窄 Unicode Python 构建）。对于像letter in unicode_letters这样的代码，使用 set 会更快：

unicode_letters = set(unicode_letters)

score 7 · Accepted Answer

没有字符串，但您可以使用unicodedata模块检查字符是否为字母，尤其是它的category()功能。

>>> unicodedata.category(u'a')
'Ll'
>>> unicodedata.category(u'A')
'Lu'
>>> unicodedata.category(u'5')
'Nd'
>>> unicodedata.category(u'ф') # Cyrillic f.
'Ll'
>>> unicodedata.category(u'٢') # Arabic-indic numeral for 2.
'Nd'

Ll意思是“字母，小写”。Lu意思是“字母，大写”。Nd意思是“数字，数字”。

score 0 · Accepted Answer

那将是一个相当大的常数。Unicode 目前涵盖超过 100.000 个不同的字符。所以答案是否定的。

问题是你为什么需要它？例如，可能有其他方法可以解决 unicodedata 模块的任何问题。

更新：您可以从ftp://ftp.unicode.org/下载包含所有 unicode 数据点名称和其他信息的文件，并用它做很多有趣的事情。

score -1 · Accepted Answer

如前面的答案所述，字符串确实太长了。因此，您必须针对 (a) 特定语言。
[编辑：我意识到这是我最初的预期用途的情况，我猜对于大多数用途。但是，与此同时，Mark Tolonen 对被问到的问题给出了很好的答案，所以我选择了他的答案，尽管我使用了以下解决方案]

这可以通过“locale”模块轻松完成：

import locale
import string
code = 'fr_FR' ## Do NOT specify encoding (see below)
locale.setlocale(locale.LC_CTYPE, code)
encoding = locale.getlocale()[1]
letters = string.letters.decode(encoding)

其中“字母”是 117 个字符长的 unicode 字符串。

显然， string.letters 取决于所选语言代码的默认编码，而不是语言本身。将语言环境设置为 fr_FR 或 de_DE 或 es_ES 会将 string.letters 更新为相同的值（因为默认情况下它们都以 ISO8859-1 编码）。

如果向语言代码 (de_DE.UTF-8) 添加编码，则将使用默认编码代替 string.letters。如果您使用上述代码的其余部分，那将导致 UnicodeDecodeError。

python - python 2.x 中 unicode 字符串的 string.ascii_letters 等价物？

4 回答 4

Related

Reference