python - 使用 python 分离 indic/abugida 脚本中的元音和辅音

问问题 2017-05-17T12:57:19.800

274 次

我正在尝试构建一个程序来帮助我将 unicode abugida 脚本转换为元音和辅音列表。我已经使用从Playing around with Devanagari characters中获取的以下脚本实现了语音的分离

#!/usr/bin/python
# -*- coding: utf-8 -*-

import unicodedata, sys

def splitclusters(s):
    """Generate the grapheme clusters for the string s. (Not the full
    Unicode text segmentation algorithm, but probably good enough for
    Devanagari.)

    """
    virama = u'\N{DEVANAGARI SIGN VIRAMA}'
    cluster = u''
    last = None
    for c in s:
        cat = unicodedata.category(c)[0]
        if cat == 'M' or cat == 'L' and last == virama:
            cluster += c
        else:
            if cluster:
                yield cluster
            cluster = c
        last = c
    if cluster:
        yield cluster

name_in_indic = raw_input('Enter your name in devanagari: ').decode('utf8')

print (','.join(list(splitclusters(name_in_indic))))

但是，我的意图是更进一步，将所有元音和辅音分开。

E.g हिंदी = ह+इ+न+द+ई

这与印地语变成 h+i+n+d+i 相同，只是在印度语脚本中，每个音素都被视为一个字符

我该怎么做？

python - 使用 python 分离 indic/abugida 脚本中的元音和辅音

0 回答 0

Related

Reference