我只需要提取那些标签与程序的 pos-tags 变量匹配的单词,并将这些单词传递给 LSI 模型,但是当我打印名词时,我得到一个空列表。
这是我的名词文件示例输入:
['All,DT', 'praise,NN', 'is,VBZ', 'due,JJ', 'to,TO', 'God,NNP', 'alone,RB', ',,,', 'the,DT', 'Sustainer,NNP', 'of,IN', 'all,PDT', 'the,DT', 'worlds,NNS', ',,,', '\n']
['The,DT', 'Most,JJS', 'Gracious,JJ', ',,,', 'the,DT', 'Dispenser,NNP', 'of,IN', 'Grace,NNP', ',,,', '\n']
['Lord,NNP', 'of,IN', 'the,DT', 'Day,NNP', 'of,IN', 'Judgment,NN', '!,.', '\n']
['Thee,NNP', 'alone,RB', 'do,VBP', 'we,PRP', 'worship,NN', ';,:', 'and,CC', 'unto,JJ', 'Thee,NNP', 'alone,RB', 'do,VBP', 'we,PRP', 'turn,VB', 'for,IN', 'aid,NN', '.,.', '\n']
['Guide,NNP', 'us,PRP', 'the,DT', 'straight,JJ', 'way,NN', '.,.', '\n']
这是我的示例代码:
import nltk
import os.path
import re
import gensim
from gensim import corpora, models
from gensim import corpora, models, similarities
thefile='E://noun.txt'
file3 = open(thefile,'r',encoding='utf8')
nouns=[]
arr=[]
arr1=[]
pos_tags = ('NN','NNS', 'NNPS')
for line in file3.readlines():
nouns.append([j.split(',')[0] for i in line for j in i if any(j.endswith(p) for p in pos_tags)])
print(nouns)
dictionary = corpora.Dictionary(nouns)
corpus = [dictionary.doc2bow(inp) for inp in nouns]
# extract 400 LSI topics; use the default one-pass algorithm
lsi = models.lsimodel.LsiModel(corpus=corpus, id2word=dictionary, num_topics=10)
# print the most contributing words (both positively and negatively) for each of the first ten topics
arr1.append(lsi.print_topics(10))
print(arr1)