我正在尝试比较(或不)语义相关的术语/表达 - 这些不是完整的句子,也不一定是单个单词;例如 -
“社交网络服务”和“社交网络”显然是密切相关的,但我如何使用 nltk 来量化呢?
显然,我什至缺少一些代码:
w1 = wordnet.synsets('social network')
返回一个空列表。
关于如何解决这个问题的任何建议?
有一些语义相关性或相似性的度量,但据我所知,它们更好地定义为 wordnet 词典中的单个单词或单个表达式 - 而不是 wordnet 词汇条目的复合词。
这是许多基于相似性 wordnet 度量的不错的 Web 实现
如果您有兴趣,可以进一步阅读有关使用 wordnet 相似性解释化合物的信息(尽管不评估化合物的相似性):
这是您可以使用的解决方案。
w1 = wordnet.synsets('social')
w2 = wordnet.synsets('network')
w1 和 w2 将有一组同义词集。找出 w1 的每个同义词集与 w2 之间的相似性。具有最大相似性的一个为您提供组合的同义词集(这是您正在寻找的)。
这是完整的代码
from nltk.corpus import wordnet
x = 'social'
y = 'network'
xsyn = wordnet.synsets(x)
# xsyn
#[Synset('sociable.n.01'), Synset('social.a.01'), Synset('social.a.02'),
#Synset('social.a.03'), Synset('social.s.04'), Synset('social.s.05'),
#Synset('social.s.06')]
ysyn = wordnet.synsets(y)
#ysyn
#[Synset('network.n.01'), Synset('network.n.02'), Synset('net.n.06'),
#Synset('network.n.04'), Synset('network.n.05'), Synset('network.v.01')]
xlen = len(xsyn)
ylen = len(ysyn)
import numpy
simindex = numpy.zeros( (xlen,ylen) )
def relative_matrix(asyn,bsyn,simindex): # find similarity between asyn & bsyn
I = -1
J = -1
for asyn_element in asyn:
I += 1
cb = wordnet.synset(asyn_element.name)
J = -1
for bsyn_element in bsyn:
J += 1
ib = wordnet.synset(bsyn_element.name)
if not cb.pos == ib.pos: # compare nn , vv not nv or an
continue
score = cb.wup_similarity(ib)
r = cb.path_similarity(ib)
if simindex [I,J] < score:
simindex [I,J] = score
relative_matrix(xsyn,ysyn,simindex)
print simindex
'''
array([[ 0.46153846, 0.125 , 0.13333333, 0.125 , 0.125 ,
0. ],
[ 0. , 0. , 0. , 0. , 0. ,
0. ],
[ 0. , 0. , 0. , 0. , 0. ,
0. ],
[ 0. , 0. , 0. , 0. , 0. ,
0. ],
[ 0. , 0. , 0. , 0. , 0. ,
0. ],
[ 0. , 0. , 0. , 0. , 0. ,
0. ],
[ 0. , 0. , 0. , 0. , 0. ,
0. ]])
'''
#xsyn[0].definition
#'a party of people assembled to promote sociability and communal activity'
#ysyn[0].definition
#'an interconnected system of things or people'
如果您看到 simindex[0,0] 是最大值 0.46153846,那么 xsyn[0] 和 ysyn[0] 似乎是w1 = wordnet.synsets('social network')
您可以通过定义看到的最佳描述。
import difflib
sm = difflib.SequenceMatcher(None)
sm.set_seq2('Social network')
#SequenceMatcher computes and caches detailed information
#about the second sequence, so if you want to compare one
#sequence against many sequences, use set_seq2() to set
#the commonly used sequence once and call set_seq1()
#repeatedly, once for each of the other sequences.
# (the doc)
for x in ('Social networking service',
'Social working service',
'Social ocean',
'Atlantic ocean',
'Atlantic and arctic oceans'):
sm.set_seq1(x)
print x,sm.ratio()
结果
Social networking service 0.717948717949
Social working service 0.611111111111
Social ocean 0.615384615385
Atlantic ocean 0.214285714286
Atlantic and arctic oceans 0.15
https://www.mashape.com/amtera/esa-semantic-relatedness
这是一个 Web API,用于计算单词对或文本摘录之间的语义相关性。
可能您需要一个 WSD 模块,它会Synset
从 NLTK 返回一个 wordnet 对象。如果是这样,你可以看看这个:https ://github.com/alvations/pywsd
$ wget https://github.com/alvations/pywsd/archive/master.zip
$ unzip master.zip
$ cd pywsd/
$ ls
baseline.py cosine.py lesk.py README.md similarity.py test_wsd.py
$ python
>>> from similarity import max_similarity
>>> sent = 'I went to the bank to deposit my money'
>>> sim_choice = "lin" # Using Lin's (1998) similarity measure.
>>> print "Context:", sent
>>> print "Similarity:", sim_choice
>>> answer = max_similarity(sent, 'bank', sim_choice)
>>> print "Sense:", answer
>>> print "Definition", answer.definition
[出去]:
Context: I went to the bank to deposit my money
Similarity: lch
Sense: Synset('bank.n.09')
Definition a building in which the business of banking transacted