python-3.x - 从重音字母到 ascii 字母的规则

Question

是否有一条规则可以帮助找到与 ascii 相关的所有重音字母的 UTF-8 代码？例如，我可以从字母的 UTF-8 代码中获得所有重音字母é, è,... 的所有 UTF-8 代码e吗？

这是使用 Ramchandra Apte 上面给出的解决方案在 Python 3 中的展示

import unicodedata

def accented_letters(letter):
    accented_chars = []

    for accent_type in "acute", "double acute", "grave", "double grave":
        try:
            accented_chars.append(
                unicodedata.lookup(
                    "Latin small letter {letter} with {accent_type}" \
                    .format(**vars())
                )
            )

        except KeyError:
            pass

    return accented_chars

print(accented_letters("e"))


for kind in ["NFC", "NFKC", "NFD", "NFKD"]:
    print(
        '---',
        kind,
        list(unicodedata.normalize(kind,"é")),
        sep = "\n"
    )

for oneChar in "βεέ.¡¿?ê":
    print(
        '---',
        oneChar,
        unicodedata.name(oneChar),

在 Unicode 中查找字形相似的字符？

        unicodedata.normalize('NFD', oneChar).encode('ascii','ignore'),
        sep = "\n"
    )

对应的输出。

['é', 'è', 'ȅ']
---
NFC
['é']
---
NFKC
['é']
---
NFD
['e', '́']
---
NFKD
['e', '́']
---
β
GREEK SMALL LETTER BETA
b''
---
ε
GREEK SMALL LETTER EPSILON
b''
---
έ
GREEK SMALL LETTER EPSILON WITH TONOS
b''
---
.
FULL STOP
b'.'
---
¡
INVERTED EXCLAMATION MARK
b''
---
¿
INVERTED QUESTION MARK
b''
---
?
QUESTION MARK
b'?'
---
ê
LATIN SMALL LETTER E WITH CIRCUMFLEX
b'e'

关于 UTF-8 的技术信息（cjc343 给出的参考）

https://www.rfc-editor.org/rfc/rfc3629

score 1 · Accepted Answer

在许多语言中，它们通常被认为是不同的字符。但是，如果您真的需要这个，您将需要找到一个标准化字符串的函数。在这种情况下，您将需要规范化以获取分解的字符，其中这些字符成为字符串中的两个 Unicode 代码点。

score 0 · Accepted Answer

使用unicodedata.lookup：

import unicodedata

def accented_letters(letter):
    accented_chars = []
    for accent_type in "acute", "double acute", "grave", "double grave":
        try:
            accented_chars.append(unicodedata.lookup("Latin small letter {letter} with {accent_type}".format(**vars())))
        except KeyError:
            pass
    return accented_chars

print(accented_letters("e"))

反过来，可以将unicodedata.normalize与 NFD 形式一起使用并取第一个字符，因为第二个字符是组合形式的重音符号。

print(unicodedata.normalize("NFD","è")[0]) # Prints "e".

python-3.x - 从重音字母到 ascii 字母的规则

这是使用 Ramchandra Apte 上面给出的解决方案在 Python 3 中的展示

在 Unicode 中查找字形相似的字符？

关于 UTF-8 的技术信息（cjc343 给出的参考）

2 回答 2

Related

Reference