签出unicodedata
模块。
>>> import unicodedata
>>> word = 'कुरुक्षेत्र'
分配给每个字符的名称:
>>> for ch in word:
print(unicodedata.name(ch))
DEVANAGARI LETTER KA
DEVANAGARI VOWEL SIGN U
DEVANAGARI LETTER RA
DEVANAGARI VOWEL SIGN U
DEVANAGARI LETTER KA
DEVANAGARI SIGN VIRAMA
DEVANAGARI LETTER SSA
DEVANAGARI VOWEL SIGN E
DEVANAGARI LETTER TA
DEVANAGARI SIGN VIRAMA
DEVANAGARI LETTER RA
分配给每个字符的一般类别:
>>> for ch in word:
print(unicodedata.category(ch))
Lo
Mn
Lo
Mn
Lo
Mn
Lo
Mn
Lo
Mn
Lo
FileFormat.info有一个 Unicode 字符类别列表。
看看这是否是您想要实现的目标:
import unicodedata
def split_clusters(txt):
""" Generate grapheme clusters for the Devanagari text."""
stop = '्'
cluster = u''
end = None
for char in txt:
category = unicodedata.category(char)
if (category == 'Lo' and end == stop) or category[0] == 'M':
cluster = cluster + char
else:
if cluster:
yield cluster
cluster = char
end = char
if cluster:
yield cluster
测试功能:
>>> list(split_clusters('धर्मक्षेत्रे'))
['ध', 'र्म', 'क्षे', 'त्रे']
>>> list(split_clusters('कुरुक्षेत्र'))
['कु', 'रु', 'क्षे', 'त्र']