2

我想制作一个简单的 Python 脚本,将每个阿拉伯字母映射到音素声音符号。我有一个文件,其中包含一堆单词,脚本将读取这些单词以将它们转换为音素,并且我的代码中有以下字典:

我的.txt文件中的内容:

السلام عليكم
السلام عليكم و رحمة الله
السلام عليكم و رحمة الله و بركاته
الحمد لله
كيف حالك
كيف الحال

我的代码中的字典:

ar_let_phon_maplist = {u'ﺍ':'A:', u'ﺏ':'B', u'ﺕ':'T', u'ﺙ':'TH', u'ﺝ':'J', u'ﺡ':'H', u'ﺥ':'KH', u'ﻩ':'H', u'ﻉ':'(ayn) ’', u'ﻍ':'GH', u'ﻑ':'F', u'ﻕ':'q', u'ﺹ':u'ṣ', u'ﺽ':u'ḍ', u'ﺩ':'D', u'ﺫ':'DH', u'ﻁ':u'ṭ', u'ﻙ':'K', u'ﻡ':'M', u'ﻥ':'N', u'ﻝ':'L', u'ﻱ':'Y', u'ﺱ':'S', u'ﺵ':'SH', u'ﻅ':u'ẓ', u'ﺯ':'Z', u'ﻭ':'W', u'ﺭ':'R'}

我有一个嵌套循环,我正在读取每一行,转换每个字符:

with codecs.open(sys.argv[1], 'r', encoding='utf-8') as file:
        lines = file.readlines()

line_counter = 0

for line in lines:
        print "Phonetics In Line " + str(line_counter)
        print line + " ",
        for word in line:
                for character in word:
                        if character == '\n':
                                print ""
                        elif character == ' ':
                                print "  "
                        else:
                                print ar_let_phon_maplist[character] + " ",
line_counter +=1

这是我得到的错误:

Phonetics In Line 0
السلام عليكم

Traceback (most recent call last):
  File "grapheme2phoneme.py", line 25, in <module>
    print ar_let_phon_maplist[character] + " ",
KeyError: u'\u0627'

然后我使用 Linux 命令检查文件类型是否为 UTF-8:

file words.txt

我得到的输出:

words.txt: UTF-8 Unicode text

这个问题的任何解决方案,为什么它不映射到字典中的 Unicode 对象,因为我用作ar_let_phon_maplist[character]行键的字符也是 Unicode?我的代码有问题吗?

4

2 回答 2

3

The first thing that catches the eye is KeyError. So your dictionary simply does not know about some symbols encountered in file. Looking ahead, it does not know about ANY of the submitted characters, not only about the first.

What we can to do with it? Okay, we can just add all of the symbols from Arabian segment of unicode table into our dictionary. Simple? Yes. Clear? No.

If you want to actually understand the reasons of this 'strange' behaviour, you should to know more about Unicode. In short, there are a lot of letters that looks similar but have different ordinal numbers. Moreover, the same letter sometimes can be presented in multiple forms. So comparing unicode characters is not a trivial task.

So, if I was allowed to use Python 3.3+ I would solve the task as follows. First I'll normalize keys in ar_let_phon_maplist dictionary:

ar_let_phon_maplist = {unicodedata.normalize('NFKD', k): v 
                            for k, v in ar_let_phon_maplist.items()}

And then we will iterate over lines in file, words in line and characters in word like this:

for index, line in enumerate(lines):
    print('Phonetics in line {0}, total {1} symbols'.format(index, len(line)))
    unknown = []  # Here will be stored symbols that we haven't found in dict
    words = line.split()
    for word in words:
        print(word, ': ', sep='', end='')
        for character in word:
            c = unicodedata.normalize('NFKD', character).casefold()
            try:                
                print(ar_let_phon_maplist[c], sep='', end='')
            except KeyError:
                print('_', sep='', end='')
                if c not in unknown:
                    unknown.append(c)
        print()
    if unknown:
        print('Unrecognized symbols: {0}, total {1} symbols'.format(', '.join(unknown), 
                                                                    len(unknown)))

Script will produce something like that:

Phonetics in line 4, total 9 symbols
كيف: KYF
حالك: HA:LK
于 2015-12-31T02:10:09.690 回答
1
于 2015-12-31T00:42:11.097 回答