python - 文档聚类和可视化

Question

我想测试一组文档是否有一些特殊的相似性，查看一个用每个人的向量表示构建的图，与其他文档的文本数据集一起显示。我猜他们会在一个可视化中在一起。

解决方案是使用 doc2vec 计算每个文档的向量并绘制它？可以以无人监督的方式完成吗？我应该使用哪个 python 库来获得 Word2vec 的那些漂亮的 2D 和 3D 表示？

score 0 · Accepted Answer

不确定你在问什么，但如果你想要一种方法来检查向量是否属于同一类型，你可以使用 K-Means。K-Means 从向量列表中生成一个数量为 K 的簇，所以如果你选择一个好的 K（不要太低，所以它会搜索一些东西，但不要太高，所以它不会太有辨别力）它可以工作。

K-Means 大体上就是这样工作的：

init_center(K) # randomly set K vector that will be the center of your cluster

while not converge(): # This one is tricky as you can find a lot of way to check for the convergence, the easiest is to check if your center has moved since the last itteration

    associate_vector() # Here you associate all the vectors to the closest center

    re_calculate_center() # And now you put the center at the... well center of their point, you can do that just by doing the mean of all the vector of the cluster.

这个gif可能比我更清楚：

这篇文章（这个gif来自哪里）真的比我更清楚，即使他在这里谈论java： https ://picoledelimao.github.io/blog/2016/03/12/multithreaded-k-means-in-爪哇/

python - 文档聚类和可视化

1 回答 1

Related

Reference