python - 用自定义词典标记句子

Question

我正在尝试使用自定义词典来标记句子。例如，如果我有两个文本文件（1. 我的句子，2. 我的字典）

语句文件：

我有腹痛和呼吸困难

字典文件：

腹痛，呼吸困难

我希望输出是这样的：

新文件：

我有腹痛 (AE) 和呼吸困难 (AE)

如何才能做到这一点？请看以下代码：

import csv
from difflib import SequenceMatcher as SM
from nltk.util import ngrams

import codecs

with open('dictionary.csv','r') as csvFile:
    reader = csv.reader(csvFile)
    myfile = open("sentences.txt", "rt")
    my3file = open("tagged_sentences.txt", "w")
    hay = myfile.read()
    myfile.close()

phrases = []
for row in reader:
    needle = row[1]
    needle_length = len(needle.split())
    max_sim_val = 0.9
    max_sim_string = u""
    for ngram in ngrams(hay.split(), needle_length + int(.2 * needle_length)):
        hay_ngram = u" ".join(ngram)

        similarity = SM(None, hay_ngram, needle).ratio()
        if similarity > max_sim_val:
            max_sim_val = similarity
            max_sim_string = hay_ngram
            str = [row[1] , ' ', max_sim_val.__str__(),' ', max_sim_string , '\n']
            str1 = max_sim_string , row[2]
            phrases.append((max_sim_string, row[2]))

for line in hay.splitlines():
    if any(max_sim_string in line for max_sim_string, _ in phrases):
        for phrase in phrases:
            max_sim_string, _ = phrase
            if max_sim_string in line:
                tag_sent = line.replace(max_sim_string, phrase.__str__())
                my3file.writelines(tag_sent + '\n')
                print(tag_sent)
                break        
    else:
        my3file.writelines(line + '\n')

csvFile.close()

上面的代码只是创建了一个空的“tagged_sentences”文件谢谢

python - 用自定义词典标记句子

0 回答 0

Related

Reference