我已经使用 gensim 为 LDA 主题建模训练了一个语料库。
浏览 gensim 网站上的教程(这不是全部代码):
question = 'Changelog generation from Github issues?';
temp = question.lower()
for i in range(len(punctuation_string)):
temp = temp.replace(punctuation_string[i], '')
words = re.findall(r'\w+', temp, flags = re.UNICODE | re.LOCALE)
important_words = []
important_words = filter(lambda x: x not in stoplist, words)
print important_words
dictionary = corpora.Dictionary.load('questions.dict')
ques_vec = []
ques_vec = dictionary.doc2bow(important_words)
print dictionary
print ques_vec
print lda[ques_vec]
这是我得到的输出:
['changelog', 'generation', 'github', 'issues']
Dictionary(15791 unique tokens)
[(514, 1), (3625, 1), (3626, 1), (3627, 1)]
[(4, 0.20400000000000032), (11, 0.20400000000000032), (19, 0.20263215848547525), (29, 0.20536784151452539)]
我不知道最后的输出将如何帮助我找到可能的主题question
!!!
请帮忙!