python - Python 和字符规范化

Question

您好，我从包含特殊字符的外部来源检索基于文本的 utf8 数据，例如u"ıöüç"我想将它们标准化为英语，例如"ıöüç"-> "iouc"。实现这一目标的最佳方法是什么？

score 43 · Accepted Answer

>>> from unidecode import unidecode
>>> unidecode(u'ıöüç')
'iouc'

注意你如何给它一个 unicode 字符串并输出一个字节字符串。输出保证为 ASCII。

score 6 · Accepted Answer

这完全取决于您想在音译结果方面走多远。如果您想将所有内容一直转换为 ASCII ( αβγto abg)，那么unidecode就是要走的路。

如果您只想从重音字母中删除重音符号，那么您可以尝试使用规范化形式 NFKD 分解您的字符串（这会将重音字母转换为后跟á的普通字母），然后丢弃重音符号（属于Unicode 字符类- “标记，非空格”）。aU+0301 COMBINING ACUTE ACCENT Mn

import unicodedata

def remove_nonspacing_marks(s):
    "Decompose the unicode string s and remove non-spacing marks."
    return ''.join(c for c in unicodedata.normalize('NFKD', s)
                   if unicodedata.category(c) != 'Mn')

score 2 · Accepted Answer

2

我发现的最简单的方法：

unicodedata.normalize('NFKD', s).encode("ascii", "ignore")

于 2017-04-12T20:54:04.683 回答

score 0 · Accepted Answer

0

import unicodedata
unicodedata.normalize()

http://docs.python.org/library/unicodedata.html

于 2010-11-12T08:05:52.973 回答

python - Python 和字符规范化

4 回答 4

Related

Reference