我正在尝试构建一个程序来帮助我将 unicode abugida 脚本转换为元音和辅音列表。我已经使用从Playing around with Devanagari characters中获取的以下脚本实现了语音的分离
#!/usr/bin/python
# -*- coding: utf-8 -*-
import unicodedata, sys
def splitclusters(s):
"""Generate the grapheme clusters for the string s. (Not the full
Unicode text segmentation algorithm, but probably good enough for
Devanagari.)
"""
virama = u'\N{DEVANAGARI SIGN VIRAMA}'
cluster = u''
last = None
for c in s:
cat = unicodedata.category(c)[0]
if cat == 'M' or cat == 'L' and last == virama:
cluster += c
else:
if cluster:
yield cluster
cluster = c
last = c
if cluster:
yield cluster
name_in_indic = raw_input('Enter your name in devanagari: ').decode('utf8')
print (','.join(list(splitclusters(name_in_indic))))
但是,我的意图是更进一步,将所有元音和辅音分开。
E.g हिंदी = ह+इ+न+द+ई
这与印地语变成 h+i+n+d+i 相同,只是在印度语脚本中,每个音素都被视为一个字符
我该怎么做?