我最近使用 LSA 然后 Kmeans 进行了一些文档聚类。但是,当我尝试打印每个集群中最重要的单词时,我得到了非常奇怪的结果,它打印的单词甚至不低于该集群。
下面是代码和输出:
# ------------------- LSA transformation ------------------------
from sklearn.decomposition import TruncatedSVD
lsa = TruncatedSVD(n_components= 7, n_iter=100)
lsa.fit(tv_matrix)
lsa_matrix = lsa.fit_transform(tv_matrix)
terms = tv.get_feature_names()
#--------------------- k means to create clusters -------------------
X = lsa_matrix
km = KMeans(n_clusters=7, random_state=0)
km.fit_transform(X)
cluster_labels = km.labels_
cluster_labels = pd.DataFrame(cluster_labels, columns=['ClusterLabel'])
X_df = pd.DataFrame(X)
result = pd.concat([corpus_df, cluster_labels], axis = 1 )
#--------printing common words in each cluster-------
common_words = km.cluster_centers_.argsort()[:,-1:-11:-1]
for num, centroid in enumerate(common_words):
print(str(num) + ' : ' + ', '.join(terms[word] for word in centroid))
#-------------------------------------------------------
然而,输出如下:
0 : ability, ability basic, ability built, ability differentiate, ability add, ability control, ability find
1 : ability add, ability, ability differentiate, ability built, ability find, ability control, ability basic
2 : ability differentiate, ability, ability find, ability control, ability basic, ability add, ability built
3 : ability basic, ability, ability built, ability find, ability control, ability differentiate, ability add
4 : ability find, ability, ability basic, ability add, ability control, ability built, ability differentiate
5 : ability built, ability, ability find, ability control, ability differentiate, ability add, ability basic
6 : ability control, ability, ability add, ability basic, ability built, ability differentiate, ability find
即使在大多数这些集群中,能力这个词也没有,有人能指出我做错了什么吗?