python - 组合变音符号不能用 unicodedata.normalize (PYTHON) 规范化

Question

我知道unicodedata.normalize将变音符号转换为非变音符号：

import unicodedata
''.join( c for c in unicodedata.normalize('NFD', u'B\u0153uf') 
            if unicodedata.category(c) != 'Mn'
       )

我的问题是（可以在这个例子中看到）： unicodedata 是否有办法将组合的 char 变音符号替换为对应的字符？(u'œ' 变成 'oe')

如果不是，我认为我将不得不对这些进行打击，但是我不妨与所有 uchars 及其对应物一起编译我自己的 dict 并unicodedata完全忘记......

score 6 · Accepted Answer

您的问题中的术语有些混乱。变音符号是可以添加到字母或其他字符但通常不独立存在的标记。（Unicode 也使用更通用的术语组合字符。）normalize('NFD', ...)所做的是将预先组合的字符转换为其组件。

无论如何，答案是 – 不是预先组合的字符。这是一个印刷连字：

>>> unicodedata.name(u'\u0153')
'LATIN SMALL LIGATURE OE'

该unicodedata模块没有提供将连字分成其部分的方法。但数据存在于角色名称中：

import re
import unicodedata

_ligature_re = re.compile(r'LATIN (?:(CAPITAL)|SMALL) LIGATURE ([A-Z]{2,})')

def split_ligatures(s):
    """
    Split the ligatures in `s` into their component letters. 
    """
    def untie(l):
        m = _ligature_re.match(unicodedata.name(l))
        if not m: return l
        elif m.group(1): return m.group(2)
        else: return m.group(2).lower()
    return ''.join(untie(l) for l in s)

>>> split_ligatures(u'B\u0153uf \u0132sselmeer \uFB00otogra\uFB00')
u'Boeuf IJsselmeer ffotograff'

（当然，在实践中您不会这样做：您会按照您在问题中的建议对 Unicode 数据库进行预处理以生成查找表。Unicode 中并没有那么多连字。）

python - 组合变音符号不能用 unicodedata.normalize (PYTHON) 规范化

1 回答 1

Related

Reference