python - Python 的“这个 Unicode 的最佳 ASCII”数据库在哪里？

Question

我有一些使用 Unicode 标点符号的文本，比如左双引号、右单引号作为撇号等等，我需要它的 ASCII 格式。Python 是否有这些字符的数据库以及明显的 ASCII 替代品，所以我可以做得比将它们全部变成“？”更好。?

score 90 · Accepted Answer

Unidecode看起来像是一个完整的解决方案。它将花哨的引号转换为 ascii 引号，将带重音的拉丁字符转换为不带重音的字符，甚至尝试音译来处理没有 ASCII 等效字符的字符。这样您的用户就不必看到一堆 ? 当您必须通过旧的 7 位 ascii 系统传递他们的文本时。

>>> from unidecode import unidecode
>>> print unidecode(u"\u5317\u4EB0")
Bei Jing

http://www.tablix.org/~avian/blog/archives/2009/01/unicode_transliteration_in_python/

score 25 · Accepted Answer

在我原来的回答中，我还建议unicodedata.normalize. 但是，我决定对其进行测试，结果发现它不适用于 Unicode 引号。它在翻译带重音的 Unicode 字符方面做得很好，所以我猜unicodedata.normalize是使用该unicode.decomposition函数实现的，这让我相信它可能只能处理由字母和变音符号组合而成的 Unicode 字符，但我不是真的Unicode 规范方面的专家，所以我可以充满热气......

在任何情况下，您都可以使用unicode.translate来处理标点符号。该translate方法将 Unicode 序数字典转换为 Unicode 序数，因此您可以创建一个映射，将 Unicode-only 标点符号转换为 ASCII 兼容标点符号：

'Maps left and right single and double quotation marks'
'into ASCII single and double quotation marks'
>>> punctuation = { 0x2018:0x27, 0x2019:0x27, 0x201C:0x22, 0x201D:0x22 }
>>> teststring = u'\u201Chello, world!\u201D'
>>> teststring.translate(punctuation).encode('ascii', 'ignore')
'"hello, world!"'

如果需要，您可以添加更多标点符号映射，但我认为您不必担心处理每个 Unicode 标点符号字符。如果您确实需要处理重音符号和其他变音符号，您仍然可以使用unicodedata.normalize来处理这些字符。

score 21 · Accepted Answer

有趣的问题。

谷歌帮助我找到了这个使用unicodedata 模块描述的页面，如下所示：

import unicodedata
unicodedata.normalize('NFKD', title).encode('ascii','ignore')

python - Python 的“这个 Unicode 的最佳 ASCII”数据库在哪里？

3 回答 3

Related

Reference