python - Scikit-learn NMF 去除重复词

问问题 2017-08-09T08:22:01.333

493 次

我正在使用 scikit-learn 的 nmf 算法从一些博客中提取趋势词。例如，我有“游戏王座”（虽然“的”被删除为停用词，但这很好），但我也有“游戏”和“王座”。我有“marcus hutchins”（好），但我也有“marcus”和“hutchins”，这很糟糕。如何防止重复？这是我所拥有的（变量“文档”是一个包含博客文章的列表）：

   tfidf_vectorizer = TfidfVectorizer(max_features=no_features, 
   stop_words='english', ngram_range=(1,3), min_df=3, max_df=0.95)
   tfidf = tfidf_vectorizer.fit_transform(documents)
   tfidf_feature_names = tfidf_vectorizer.get_feature_names()

   # no of topics to display
   no_topics = 5

   # Run NMF
   nmf = NMF(n_components=no_topics, random_state=1, alpha=.1, l1_ratio=.5, 
   init='nndsvd').fit(tfidf)

   # no of words to display for each topic
   no_top_words = 10

python - Scikit-learn NMF 去除重复词

0 回答 0

Related

Reference