python - 使用 NLTK 比较术语/表达式的相似性？

Question

我正在尝试比较（或不）语义相关的术语/表达 - 这些不是完整的句子，也不一定是单个单词；例如 -

“社交网络服务”和“社交网络”显然是密切相关的，但我如何使用 nltk 来量化呢？

显然，我什至缺少一些代码：

w1 = wordnet.synsets('social network')

返回一个空列表。

关于如何解决这个问题的任何建议？

score 3 · Accepted Answer

有一些语义相关性或相似性的度量，但据我所知，它们更好地定义为 wordnet 词典中的单个单词或单个表达式 - 而不是 wordnet 词汇条目的复合词。

这是许多基于相似性 wordnet 度量的不错的 Web 实现

http://wn-similarity.sourceforge.net/

如果您有兴趣，可以进一步阅读有关使用 wordnet 相似性解释化合物的信息（尽管不评估化合物的相似性）：

CiteSeerX（引文更清晰）
同一篇文章，PDF

score 2 · Accepted Answer

这是您可以使用的解决方案。

     w1 = wordnet.synsets('social')
     w2 = wordnet.synsets('network')

w1 和 w2 将有一组同义词集。找出 w1 的每个同义词集与 w2 之间的相似性。具有最大相似性的一个为您提供组合的同义词集（这是您正在寻找的）。

这是完整的代码

from nltk.corpus import wordnet
x = 'social'
y = 'network'
xsyn = wordnet.synsets(x)
# xsyn
#[Synset('sociable.n.01'), Synset('social.a.01'), Synset('social.a.02'),   
#Synset('social.a.03'), Synset('social.s.04'), Synset('social.s.05'),   
#Synset('social.s.06')]

ysyn = wordnet.synsets(y)
#ysyn
#[Synset('network.n.01'), Synset('network.n.02'), Synset('net.n.06'), 
#Synset('network.n.04'), Synset('network.n.05'), Synset('network.v.01')]

xlen = len(xsyn)
ylen = len(ysyn)

import numpy
simindex = numpy.zeros( (xlen,ylen) )

def relative_matrix(asyn,bsyn,simindex): # find similarity between asyn & bsyn

    I = -1
    J = -1

    for asyn_element in asyn:
        I += 1

        cb = wordnet.synset(asyn_element.name)
        J = -1
        for bsyn_element in bsyn:
            J += 1
            ib = wordnet.synset(bsyn_element.name)
            if not cb.pos == ib.pos: # compare nn , vv not nv or an
                continue
            score = cb.wup_similarity(ib)
            r = cb.path_similarity(ib)
            if simindex [I,J] < score:
                simindex [I,J] = score

 relative_matrix(xsyn,ysyn,simindex)
 print simindex
'''
array([[ 0.46153846,  0.125     ,  0.13333333,  0.125     ,  0.125     ,
     0.        ],
   [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
     0.        ],
   [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
     0.        ],
   [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
     0.        ],
   [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
     0.        ],
   [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
     0.        ],
   [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
     0.        ]])
'''
#xsyn[0].definition
#'a party of people assembled to promote sociability and communal activity'
#ysyn[0].definition
#'an interconnected system of things or people'

如果您看到 simindex[0,0] 是最大值 0.46153846，那么 xsyn[0] 和 ysyn[0] 似乎是w1 = wordnet.synsets('social network')您可以通过定义看到的最佳描述。

score 1 · Accepted Answer

import difflib

sm = difflib.SequenceMatcher(None)

sm.set_seq2('Social network')
#SequenceMatcher computes and caches detailed information
#about the second sequence, so if you want to compare one
#sequence against many sequences, use set_seq2() to set
#the commonly used sequence once and call set_seq1()
#repeatedly, once for each of the other sequences.
# (the doc)

for x in ('Social networking service',
          'Social working service',
          'Social ocean',
          'Atlantic ocean',
          'Atlantic and arctic oceans'):
    sm.set_seq1(x)
    print x,sm.ratio()

结果

Social networking service 0.717948717949
Social working service 0.611111111111
Social ocean 0.615384615385
Atlantic ocean 0.214285714286
Atlantic and arctic oceans 0.15

score 1 · Accepted Answer

https://www.mashape.com/amtera/esa-semantic-relatedness

这是一个 Web API，用于计算单词对或文本摘录之间的语义相关性。

score 1 · Accepted Answer

可能您需要一个 WSD 模块，它会Synset从 NLTK 返回一个 wordnet 对象。如果是这样，你可以看看这个：https ://github.com/alvations/pywsd

$ wget https://github.com/alvations/pywsd/archive/master.zip
$ unzip master.zip
$ cd pywsd/
$ ls
baseline.py  cosine.py  lesk.py  README.md  similarity.py  test_wsd.py
$ python
>>> from similarity import max_similarity
>>> sent = 'I went to the bank to deposit my money'
>>> sim_choice = "lin" # Using Lin's (1998) similarity measure.
>>> print "Context:", sent
>>> print "Similarity:", sim_choice 
>>> answer = max_similarity(sent, 'bank', sim_choice)
>>> print "Sense:", answer
>>> print "Definition", answer.definition

[出去]：

Context: I went to the bank to deposit my money
Similarity: lch
Sense: Synset('bank.n.09')
Definition a building in which the business of banking transacted

python - 使用 NLTK 比较术语/表达式的相似性？

5 回答 5

Related

Reference