python-3.x - NLTK。Lesk 为相同的输入返回不同的结果

Question

我正在使用 LESK 算法从文本中获取 SynSet。但是我用相同的输入得到不同的结果。是 Lesk 算法“功能”还是我做错了什么？接下来是我正在使用的代码：

    self.SynSets =[]
    sentences = sent_tokenize("Python is a widely used general-purpose, high-level programming language.\
        Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java.\
        The language provides constructs intended to enable clear programs on both a small and large scale.\
        Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles.\
        ")
    stopwordsList =  stopwords.words('english')
    self.sentNum=0;
    for sentence in sentences:
        raw_tokens =  word_tokenize(sentence)
        final_tokens = [token.lower() for token in raw_tokens 
                    if(not token in stopwordsList) 
                    #and (len(token) > 3) 
                    and not token.isdigit()]
        for token in final_tokens:
            synset = wsd.lesk(sentence, token)
            if not synset is None:
                self.SynSets.append(synset)

    self.SynSets = set(self.SynSets)
    self.WriteSynSets()
    return self

在输出我有结果（前 3 个结果来自 2 个不同的启动）：

Synset('allow.v.09')   Synset('code.n.03')   Synset('coffee.n.01') 
------------
Synset('allow.v.09')   Synset('argumentation.n.02')   Synset('boastfully.r.01')

如果有其他（更稳定的）方法来获取同义词集，我将感谢您的帮助。

提前致谢。

已编辑

对于其他示例，这里是我运行了 2 次的完整脚本：

import nltk
from nltk.tokenize import sent_tokenize
from nltk import word_tokenize
from nltk import wsd
from nltk.corpus import stopwords

SynSets =[]
sentences = sent_tokenize("Python is a widely used general-purpose, high-level programming language.\
    Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java.\
    The language provides constructs intended to enable clear programs on both a small and large scale.\
    Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles.\
    ")
stopwordsList =  stopwords.words('english')

for sentence in sentences:
    raw_tokens =  word_tokenize(sentence)#WordPunctTokenizer().tokenize(sentence)
    #removing stopwords and words, smaller than 3 characters
    final_tokens = [token.lower() for token in raw_tokens 
                if(not token in stopwordsList) 
                #and (len(token) > 3) 
                and not token.isdigit()]
    for token in final_tokens:
        synset = wsd.lesk(sentence, token)
        if not synset is None:
            SynSets.append(synset)


SynSets = set(SynSets)

SynSets = sorted(SynSets)
with open("synsets.txt", "a") as file:
    file.write("\n-------------------\n")
    for synset in SynSets:
        file.write("{}   ".format(str(synset.__str__())))
file.close()

我得到了这些结果（前 4 个生成的同义词集在我运行程序的 2 次每次都写入文件中）：

Synset('allow.v.04') Synset('boastfully.r.01') Synset('clear.v.11') Synset('code.n.02')
Synset('boastfully.r.01') Synset('clear.v.19') Synset('code.n.01') Synset('design.n.04')

解决方案：我有什么问题。重新安装 python 2.7 后，所有问题都消失了。所以，不要将 python 3.x 与 lesk 算法一起使用。

score 2 · Accepted Answer

最新版本的NLTK中有一个用于lesk算法的wsd函数：

>>> from nltk.wsd import lesk
>>> from nltk import sent_tokenize
>>> text = "Python is a widely used general-purpose, high-level programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java. The language provides constructs intended to enable clear programs on both a small and large scale. Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles."
>>> for sent in sent_tokenize(text):
...     for word in word_tokenize(sent):
...             print word, lesk(sent, word), sent

[出去]：

Python Synset('python.n.02') Python is a widely used general-purpose, high-level programming language.
is Synset('be.v.08') Python is a widely used general-purpose, high-level programming language.
a Synset('angstrom.n.01') Python is a widely used general-purpose, high-level programming language.
widely Synset('wide.r.04') Python is a widely used general-purpose, high-level programming language.
used Synset('use.v.01') Python is a widely used general-purpose, high-level programming language.
general-purpose None Python is a widely used general-purpose, high-level programming language.
, None Python is a widely used general-purpose, high-level programming language.

另外，尝试disambiguate()来自pywsd（https://github.com/alvations/pywsd）：

>>> from pywsd import disambiguate>>> from nltk import sent_tokenize
>>> text = "Python is a widely used general-purpose, high-level programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java. The language provides constructs intended to enable clear programs on both a small and large scale. Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles."
>>> for sent in sent_tokenize(text):
...     print disambiguate(sent, prefersNone=True)
...

[出去]：

[('Python', Synset('python.n.02')), ('is', None), ('a', None), ('widely', Synset('widely.r.03')), ('used', Synset('used.a.01')), ('general-purpose', None), (',', None), ('high-level', None), ('programming', Synset('scheduling.n.01')), ('language', Synset('terminology.n.01')), ('.', None)]
[('Its', None), ('design', Synset('purpose.n.01')), ('philosophy', Synset('philosophy.n.03')), ('emphasizes', Synset('stress.v.01')), ('code', Synset('code.n.03')), ('readability', Synset('readability.n.01')), (',', None), ('and', None), ('its', None), ('syntax', Synset('syntax.n.03')), ('allows', Synset('let.v.01')), ('programmers', Synset('programmer.n.01')), ('to', None), ('express', Synset('express.n.03')), ('concepts', Synset('concept.n.01')), ('in', None), ('fewer', None), ('lines', Synset('wrinkle.n.01')), ('of', None), ('code', Synset('code.n.03')), ('than', None), ('would', None), ('be', None), ('possible', Synset('potential.a.01')), ('in', None), ('languages', Synset('linguistic_process.n.02')), ('such', None), ('as', None), ('C++', None), ('or', None), ('Java', Synset('java.n.03')), ('.', None)]
[('The', None), ('language', Synset('language.n.01')), ('provides', Synset('provide.v.06')), ('constructs', Synset('concept.n.01')), ('intended', Synset('mean.v.03')), ('to', None), ('enable', None), ('clear', Synset('open.n.01')), ('programs', Synset('program.n.08')), ('on', None), ('both', None), ('a', None), ('small', Synset('small.a.01')), ('and', None), ('large', Synset('large.a.01')), ('scale', Synset('scale.n.10')), ('.', None)]
[('Python', Synset('python.n.02')), ('supports', Synset('support.n.11')), ('multiple', None), ('programming', Synset('program.v.02')), ('paradigms', Synset('substitution_class.n.01')), (',', None), ('including', Synset('include.v.03')), ('object-oriented', None), (',', None), ('imperative', Synset('imperative.a.02')), ('and', None), ('functional', Synset('functional.a.01')), ('programming', Synset('scheduling.n.01')), ('or', None), ('procedural', Synset('procedural.a.01')), ('styles', Synset('vogue.n.01')), ('.', None)]

它们并不完美，但它们接近 lesk 的准确实现。

已编辑

为了验证每次运行的结果是否相同，执行此操作时应该没有 STDOUT：

from nltk.wsd import lesk
from nltk import sent_tokenize, word_tokenize
text = "Python is a widely used general-purpose, high-level programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java. The language provides constructs intended to enable clear programs on both a small and large scale. Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles."

lst = []
for sent in sent_tokenize(text):
    lst = []
    for word in word_tokenize(sent):
        lst.append(lesk(sent, word))
    for i in range(10):
        lst2 = []
        for word in word_tokenize(sent):
            lst2.append(lesk(sent, word))
        assert lst2 == lst

我运行了 OP 的代码 10 次，但结果相同：

import nltk
from nltk.tokenize import sent_tokenize
from nltk import word_tokenize
from nltk import wsd
from nltk.corpus import stopwords

def run():
    SynSets =[]
    sentences = sent_tokenize("Python is a widely used general-purpose, high-level programming language.\
        Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java.\
        The language provides constructs intended to enable clear programs on both a small and large scale.\
        Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles.\
        ")
    stopwordsList =  stopwords.words('english')

    for sentence in sentences:
        raw_tokens =  word_tokenize(sentence)#WordPunctTokenizer().tokenize(sentence)
        #removing stopwords and words, smaller than 3 characters
        final_tokens = [token.lower() for token in raw_tokens 
                    if(not token in stopwordsList) 
                    #and (len(token) > 3) 
                    and not token.isdigit()]
        for token in final_tokens:
            synset = wsd.lesk(sentence, token)
            if not synset is None:
                SynSets.append(synset)
    return sorted(set(SynSets))

run1 = run()

for i in range(10):
    assert run1 == run()

python-3.x - NLTK。Lesk 为相同的输入返回不同的结果

1 回答 1

Related

Reference