通过使用 wordnet 文本匹配,我意识到 wordnet 只能将单个单词匹配到单个单词。它不能将单个单词与短语匹配。
如您所见,我有两个列表。
list1=['fruit', 'world']
list2=[u'domain', u'creation Year', u'world Tournament Silver', u'relation', u'existence', u'id', u'publication',
u'third Commander', u'management Region', u'ra', u'Earthquake', u'final Publication Year', u'creation Christian Bishop',
u'Planet', u'management Position', u'Race', u'world', u'first Publication Year', u'main Domain',
u'golden Globe Award', u'ist', u'race', u'world Tournament Bronze', u'top Level Domain', u'lower Earth Orbit Payload']
list2 包含单个单词和短语。比如关系、管理职位……
目前我使用 wordnet 来查找相似性
list=[]
for word1 in list1:
# print word1
for word2 in list2:
# print word2
wordFromList1 = wordnet.synsets(word1)
wordFromList2 = wordnet.synsets(word2)
if wordFromList1 and wordFromList2:
s = wordFromList1[0].wup_similarity(wordFromList2[0])
w1= (wordFromList1[0].lemmas()[0].name())
w2=(wordFromList2[0].lemmas()[0].name())
similarity = (s, w1, w2)
print similarity
结果:
(0.125, u'fruit', u'sphere')
(0.16666666666666666, u'fruit', u'relation')
(0.14285714285714285, u'fruit', u'being')
(0.3157894736842105, u'fruit', u'Idaho')
(0.4444444444444444, u'fruit', u'publication')
(0.25, u'fruit', u'radium')
(0.25, u'fruit', u'earthquake')
(0.625, u'fruit', u'planet')
(0.125, u'fruit', u'race')
(0.6666666666666666, u'fruit', u'universe')
(0.125, u'fruit', u'race')
(0.15384615384615385, u'universe', u'sphere')
(0.2222222222222222, u'universe', u'relation')
(0.18181818181818182, u'universe', u'being')
(0.375, u'universe', u'Idaho')
(0.5333333333333333, u'universe', u'publication')
(0.3076923076923077, u'universe', u'radium')
(0.3076923076923077, u'universe', u'earthquake')
(0.7692307692307693, u'universe', u'planet')
(0.15384615384615385, u'universe', u'race')
(1.0, u'universe', u'universe')
(0.15384615384615385, u'universe', u'race')
问题是 wordnet 只比较单个词,它不比较单个词与列表中的短语之间的相似度2。
such as 'world' VS 'world Tournament Silver'
'world' VS 'world Tournament Bronze'
'world' VS 'createion Year'
.......................
那么如何解决这个问题呢?