python - 组合梵文字符

Question

我有类似的东西

a = "बिक्रम मेरो नाम हो"

我想实现类似

a[0] = बि
a[1] = क्र
a[3] = म

但是由于 म 需要 4 个字节，而 बि 需要 8 个字节，所以我无法直截了当。那么可以做些什么来实现这一目标呢？在 Python 中。

score 24 · Accepted Answer

将文本拆分为字素簇的算法在Unicode 附件 29的第 3.1 节中给出。我不会在这里为你实现完整的算法，但我会大致向你展示如何处理天城文的情况，然后你可以自己阅读附件，看看你还需要实现什么。

该unicodedata模块包含检测字素簇所需的信息。

>>> import unicodedata
>>> a = "बिक्रम मेरो नाम हो"
>>> [unicodedata.name(c) for c in a]
['DEVANAGARI LETTER BA', 'DEVANAGARI VOWEL SIGN I', 'DEVANAGARI LETTER KA', 
 'DEVANAGARI SIGN VIRAMA', 'DEVANAGARI LETTER RA', 'DEVANAGARI LETTER MA',
 'SPACE', 'DEVANAGARI LETTER MA', 'DEVANAGARI VOWEL SIGN E',
 'DEVANAGARI LETTER RA', 'DEVANAGARI VOWEL SIGN O', 'SPACE',
 'DEVANAGARI LETTER NA', 'DEVANAGARI VOWEL SIGN AA', 'DEVANAGARI LETTER MA',
 'SPACE', 'DEVANAGARI LETTER HA', 'DEVANAGARI VOWEL SIGN O']

在梵文中，每个字素簇由一个首字母、可选的 virama（元音杀手）和字母对以及一个可选的元音符号组成。在正则表达式表示法中，这将是LETTER (VIRAMA LETTER)* VOWEL?. 您可以通过查找每个代码点的Unicode 类别来判断哪个是哪个：

>>> [unicodedata.category(c) for c in a]
['Lo', 'Mc', 'Lo', 'Mn', 'Lo', 'Lo', 'Zs', 'Lo', 'Mn', 'Lo', 'Mc', 'Zs',
 'Lo', 'Mc', 'Lo', 'Zs', 'Lo', 'Mc']

字母是类别Lo（Letter, Other），元音符号是类别Mc（Mark, Spacing Combining），virama 是类别Mn（Mark, Nonspacing），空格是类别Zs（Separator, Space）。

所以这里有一个粗略的方法来拆分字素簇：

def splitclusters(s):
    """Generate the grapheme clusters for the string s. (Not the full
    Unicode text segmentation algorithm, but probably good enough for
    Devanagari.)

    """
    virama = u'\N{DEVANAGARI SIGN VIRAMA}'
    cluster = u''
    last = None
    for c in s:
        cat = unicodedata.category(c)[0]
        if cat == 'M' or cat == 'L' and last == virama:
            cluster += c
        else:
            if cluster:
                yield cluster
            cluster = c
        last = c
    if cluster:
        yield cluster

>>> list(splitclusters(a))
['बि', 'क्र', 'म', ' ', 'मे', 'रो', ' ', 'ना', 'म', ' ', 'हो']

score 15 · Accepted Answer

所以，你想实现这样的目标

a[0] = बि a[1] = क्र a[3] = म

我的建议是放弃字符串索引对应于您在屏幕上看到的字符的想法。天城文以及其他几个脚本不能很好地与拉丁字符一起长大的程序员相处。我建议阅读 Unicode 标准第 9 章（可在此处获得）。

看起来您正在尝试做的是将字符串分解为字素簇。字符串索引本身不会让您这样做。Hangul 是另一种在字符串索引方面表现不佳的脚本，尽管使用组合字符，即使是像西班牙语这样熟悉的东西也会导致问题。

您将需要一个外部库，例如 ICU 来实现这一点（除非您有很多空闲时间）。ICU 有 Python 绑定。

>>> a = u"बिक्रम मेरो नाम हो"
>>> import icu
    # Note: This next line took a lot of guesswork.  The C, C++, and Java
    # interfaces have better documentation.
>>> b = icu.BreakIterator.createCharacterInstance(icu.Locale())
>>> b.setText(a)
>>> i = 0
>>> for j in b:
...     s = a[i:j]
...     print '|', s, len(s)
...     i = j
... 
| बि 2
| क् 2
| र 1
| म 1
|   1
| मे 2
| रो 2
|   1
| ना 2
| म 1
|   1
| हो 2

注意其中一些“字符”（字素簇）的长度为 2，而另一些的长度为 1。这就是字符串索引存在问题的原因：如果我想从文本文件中获取字素簇 #69450，那么我必须线性扫描通过整个文件和计数。所以你的选择是：

建立一个索引（有点疯狂......）
只是意识到你不能打破每个字符边界。中断迭代器对象能够向前和向后移动，因此如果您需要提取字符串的前 140 个字符，则查看索引 140 并向后迭代到前一个字素簇中断，这样您就不会结束加上有趣的文字。（更好的是，您可以为适当的语言环境使用分词迭代器。）使用这种抽象级别（字符迭代器等）的好处是，您使用哪种编码不再重要：您可以使用 UTF-8， UTF-16、UTF-32 和它都可以正常工作。嗯，主要是有效的。

score 3 · Accepted Answer

对于任何支持的引擎，您都可以使用简单的正则表达式来实现这一点\X

演示

不幸的是，Python 的 re不支持\X 字形匹配。

幸运的是，提议的替代品regex确实支持\X：

>>> a = "बिक्रम मेरो नाम हो"
>>> regex.findall(r'\X', a)
['बि', 'क्', 'र', 'म', ' ', 'मे', 'रो', ' ', 'ना', 'म', ' ', 'हो']

score 1 · Accepted Answer

Indic and non Latin scripts like Hangul do not generally follow the idea of matching string indices to code points. It's generally a pain working with Indic scripts. Most characters are two bytes with some rare ones extending into three. With Dravidian, it's no defined order. See the Unicode specification for more details.

That said,check here for some ideas about unicode and python with C++.

Finally,as said by Dietrich, you might want to check out ICU too. It has bindings available for C/C++ and java via icu4c and icu4j respectively. There's some learning curve involved, so I suggest you set aside ~~some~~ loads of time for it. :)

score 1 · Accepted Answer

1

于 2020-07-19T18:43:39.827 回答

score 0 · Accepted Answer

有一个名为的纯 Python 库uniseg，它提供了许多实用程序，包括提供您描述的行为的字形集群迭代器：

>>> a = u"बिक्रम मेरो नाम हो"
>>> from uniseg.graphemecluster import grapheme_clusters
>>> for i in grapheme_clusters(a): print(i)
... 
बि
क्
र
म

मे
रो

ना
म

हो

它声称实现了http://www.unicode.org/reports/tr29/tr29-21.html中描述的完整的 Unicode 文本分割算法。

python - 组合梵文字符

6 回答 6

Related

Reference