我在 python 中创建了一个 NMF 主题模型,其代码片段如下:
def select_vectorizer(req_ngram_range=[1,2]):
ngram_lengths = req_ngram_range
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(ngram_lengths), stop_words='english', min_df=2)
#print("User specified custom stopwords: {} ...".format(str(custom_stopwords)[1:-1]))
return vectorizer
vectorizer = select_vectorizer([2,5])
X = vectorizer.fit_transform(new_review_list)
clf = decomposition.NMF(n_components=20, random_state=3, alpha = .1).fit(X)
vocab = vectorizer.get_feature_names()
print_top_words(clf, vocab, num_top_words)
它创建了 20 个主题,如下所示:
Topic #0:
[u'blocks available', u'delivery blocks available', u'notifications blocks', u'notifications blocks available', u'new blocks', u'know blocks available', u'new blocks available', u'know blocks', u'open blocks available', u'available work', u'zero blocks', u'like blocks', u'notification blocks', u'day blocks', u'slow blocks', u'10 blocks', u'option set', u'logged 10', u'notification blocks available', u'day blocks available']
Topic #1:
[u'amazon flex', u'working amazon', u'amazon flex app', u'working amazon flex', u'hello amazon', u'hello amazon flex', u'flex delivery', u'amazon flex delivery', u'flex team', u'amazon flex team', u'work amazon', u'amazon flex support', u'flex support', u'work amazon flex', u'deliver amazon', u'hi amazon flex', u'hi amazon', u'deliver amazon flex', u'signed amazon', u'love amazon'] and so on..
现在我想在新文本上进行测试,以便根据这些类别对这些文本进行分类。我怎么做?