我正在使用 LESK 算法从文本中获取 SynSet。但是我用相同的输入得到不同的结果。是 Lesk 算法“功能”还是我做错了什么?接下来是我正在使用的代码:
self.SynSets =[]
sentences = sent_tokenize("Python is a widely used general-purpose, high-level programming language.\
Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java.\
The language provides constructs intended to enable clear programs on both a small and large scale.\
Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles.\
")
stopwordsList = stopwords.words('english')
self.sentNum=0;
for sentence in sentences:
raw_tokens = word_tokenize(sentence)
final_tokens = [token.lower() for token in raw_tokens
if(not token in stopwordsList)
#and (len(token) > 3)
and not token.isdigit()]
for token in final_tokens:
synset = wsd.lesk(sentence, token)
if not synset is None:
self.SynSets.append(synset)
self.SynSets = set(self.SynSets)
self.WriteSynSets()
return self
在输出我有结果(前 3 个结果来自 2 个不同的启动):
Synset('allow.v.09') Synset('code.n.03') Synset('coffee.n.01')
------------
Synset('allow.v.09') Synset('argumentation.n.02') Synset('boastfully.r.01')
如果有其他(更稳定的)方法来获取同义词集,我将感谢您的帮助。
提前致谢。
已编辑
对于其他示例,这里是我运行了 2 次的完整脚本:
import nltk
from nltk.tokenize import sent_tokenize
from nltk import word_tokenize
from nltk import wsd
from nltk.corpus import stopwords
SynSets =[]
sentences = sent_tokenize("Python is a widely used general-purpose, high-level programming language.\
Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java.\
The language provides constructs intended to enable clear programs on both a small and large scale.\
Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles.\
")
stopwordsList = stopwords.words('english')
for sentence in sentences:
raw_tokens = word_tokenize(sentence)#WordPunctTokenizer().tokenize(sentence)
#removing stopwords and words, smaller than 3 characters
final_tokens = [token.lower() for token in raw_tokens
if(not token in stopwordsList)
#and (len(token) > 3)
and not token.isdigit()]
for token in final_tokens:
synset = wsd.lesk(sentence, token)
if not synset is None:
SynSets.append(synset)
SynSets = set(SynSets)
SynSets = sorted(SynSets)
with open("synsets.txt", "a") as file:
file.write("\n-------------------\n")
for synset in SynSets:
file.write("{} ".format(str(synset.__str__())))
file.close()
我得到了这些结果(前 4 个生成的同义词集在我运行程序的 2 次每次都写入文件中):
Synset('allow.v.04') Synset('boastfully.r.01') Synset('clear.v.11') Synset('code.n.02')
Synset('boastfully.r.01') Synset('clear.v.19') Synset('code.n.01') Synset('design.n.04')
解决方案:我有什么问题。重新安装 python 2.7 后,所有问题都消失了。所以,不要将 python 3.x 与 lesk 算法一起使用。