任务是对由多个单词(aka Multi-Word Expressions
)组成的表达式进行分组。
给定 MWE 字典,我需要在检测到 MWE 的输入句子中添加破折号,例如
**Input:** i have got an ace of diamonds in my wet suit .
**Output:** i have got an ace-of-diamonds in my wet-suit .
目前我遍历排序的字典,看看 MWE 是否出现在句子中,并在出现时替换它们。但是有很多浪费的迭代。
有更好的方法吗?一种解决方案是产生所有可能的 n-gram 1st,即chunker2()
import re, time
mwe_list =set([i.strip() for i in codecs.open( \
"wn-mwe-en.dic","r","utf8").readlines()])
def chunker(sentence):
for item in mwe_list:
if item or item.replace("-", " ") in sentence:
#print item
mwe_item = '-'.join(item.split(" "))
r=re.compile(re.escape(mwe_item).replace('\\-','[- ]'))
sentence=re.sub(r,mwe_item,sentence)
return sentence
def chunker2(sentence):
nodes = []
tokens = sentence.split(" ")
for i in range(0,len(tokens)):
for j in range(i,len(tokens)):
nodes.append(" ".join(tokens[i:j]))
n = sorted(set([i for i in nodes if not "" and len(i.split(" ")) > 1]))
intersect = mwe_list.intersection(n)
for i in intersect:
print i
sentence = sentence.replace(i, i.replace(" ", "-"))
return sentence
s = "i have got an ace of diamonds in my wet suit ."
time.clock()
print chunker(s)
print time.clock()
time.clock()
print chunker2(s)
print time.clock()