我有类似的东西
a = "बिक्रम मेरो नाम हो"
我想实现类似
a[0] = बि
a[1] = क्र
a[3] = म
但是由于 म 需要 4 个字节,而 बि 需要 8 个字节,所以我无法直截了当。那么可以做些什么来实现这一目标呢?在 Python 中。
我有类似的东西
a = "बिक्रम मेरो नाम हो"
我想实现类似
a[0] = बि
a[1] = क्र
a[3] = म
但是由于 म 需要 4 个字节,而 बि 需要 8 个字节,所以我无法直截了当。那么可以做些什么来实现这一目标呢?在 Python 中。
将文本拆分为字素簇的算法在Unicode 附件 29的第 3.1 节中给出。我不会在这里为你实现完整的算法,但我会大致向你展示如何处理天城文的情况,然后你可以自己阅读附件,看看你还需要实现什么。
该unicodedata
模块包含检测字素簇所需的信息。
>>> import unicodedata
>>> a = "बिक्रम मेरो नाम हो"
>>> [unicodedata.name(c) for c in a]
['DEVANAGARI LETTER BA', 'DEVANAGARI VOWEL SIGN I', 'DEVANAGARI LETTER KA',
'DEVANAGARI SIGN VIRAMA', 'DEVANAGARI LETTER RA', 'DEVANAGARI LETTER MA',
'SPACE', 'DEVANAGARI LETTER MA', 'DEVANAGARI VOWEL SIGN E',
'DEVANAGARI LETTER RA', 'DEVANAGARI VOWEL SIGN O', 'SPACE',
'DEVANAGARI LETTER NA', 'DEVANAGARI VOWEL SIGN AA', 'DEVANAGARI LETTER MA',
'SPACE', 'DEVANAGARI LETTER HA', 'DEVANAGARI VOWEL SIGN O']
在梵文中,每个字素簇由一个首字母、可选的 virama(元音杀手)和字母对以及一个可选的元音符号组成。在正则表达式表示法中,这将是LETTER (VIRAMA LETTER)* VOWEL?
. 您可以通过查找每个代码点的Unicode 类别来判断哪个是哪个:
>>> [unicodedata.category(c) for c in a]
['Lo', 'Mc', 'Lo', 'Mn', 'Lo', 'Lo', 'Zs', 'Lo', 'Mn', 'Lo', 'Mc', 'Zs',
'Lo', 'Mc', 'Lo', 'Zs', 'Lo', 'Mc']
字母是类别Lo
(Letter, Other),元音符号是类别Mc
(Mark, Spacing Combining),virama 是类别Mn
(Mark, Nonspacing),空格是类别Zs
(Separator, Space)。
所以这里有一个粗略的方法来拆分字素簇:
def splitclusters(s):
"""Generate the grapheme clusters for the string s. (Not the full
Unicode text segmentation algorithm, but probably good enough for
Devanagari.)
"""
virama = u'\N{DEVANAGARI SIGN VIRAMA}'
cluster = u''
last = None
for c in s:
cat = unicodedata.category(c)[0]
if cat == 'M' or cat == 'L' and last == virama:
cluster += c
else:
if cluster:
yield cluster
cluster = c
last = c
if cluster:
yield cluster
>>> list(splitclusters(a))
['बि', 'क्र', 'म', ' ', 'मे', 'रो', ' ', 'ना', 'म', ' ', 'हो']
所以,你想实现这样的目标
a[0] = बि a[1] = क्र a[3] = म
我的建议是放弃字符串索引对应于您在屏幕上看到的字符的想法。天城文以及其他几个脚本不能很好地与拉丁字符一起长大的程序员相处。我建议阅读 Unicode 标准第 9 章(可在此处获得)。
看起来您正在尝试做的是将字符串分解为字素簇。字符串索引本身不会让您这样做。Hangul 是另一种在字符串索引方面表现不佳的脚本,尽管使用组合字符,即使是像西班牙语这样熟悉的东西也会导致问题。
您将需要一个外部库,例如 ICU 来实现这一点(除非您有很多空闲时间)。ICU 有 Python 绑定。
>>> a = u"बिक्रम मेरो नाम हो"
>>> import icu
# Note: This next line took a lot of guesswork. The C, C++, and Java
# interfaces have better documentation.
>>> b = icu.BreakIterator.createCharacterInstance(icu.Locale())
>>> b.setText(a)
>>> i = 0
>>> for j in b:
... s = a[i:j]
... print '|', s, len(s)
... i = j
...
| बि 2
| क् 2
| र 1
| म 1
| 1
| मे 2
| रो 2
| 1
| ना 2
| म 1
| 1
| हो 2
注意其中一些“字符”(字素簇)的长度为 2,而另一些的长度为 1。这就是字符串索引存在问题的原因:如果我想从文本文件中获取字素簇 #69450,那么我必须线性扫描通过整个文件和计数。所以你的选择是:
Indic and non Latin scripts like Hangul do not generally follow the idea of matching string indices to code points. It's generally a pain working with Indic scripts. Most characters are two bytes with some rare ones extending into three. With Dravidian, it's no defined order. See the Unicode specification for more details.
That said,check here for some ideas about unicode and python with C++.
Finally,as said by Dietrich, you might want to check out ICU too. It has bindings available for C/C++ and java via icu4c and icu4j respectively. There's some learning curve involved, so I suggest you set aside some loads of time for it. :)
有一个名为的纯 Python 库uniseg
,它提供了许多实用程序,包括提供您描述的行为的字形集群迭代器:
>>> a = u"बिक्रम मेरो नाम हो"
>>> from uniseg.graphemecluster import grapheme_clusters
>>> for i in grapheme_clusters(a): print(i)
...
बि
क्
र
म
मे
रो
ना
म
हो
它声称实现了http://www.unicode.org/reports/tr29/tr29-21.html中描述的完整的 Unicode 文本分割算法。