python - 根据内容相似度在网格中排列文档

Question

如何将文档排列到一个空间（例如多个网格）中，以便放置它们的位置包含有关它们与其他文档的相似程度的信息。我研究了 K-means 聚类，但如果数据很大，它的计算量会有点大。我正在寻找诸如散列文档内容之类的东西，以便它们可以容纳在大空间中，并且相似的文档将具有相似的散列并且它们之间的距离会很小。在这种情况下，很容易找到与给定文档相似的文档，而无需做很多额外的工作。

结果可能类似于下图。在这种情况下，音乐文件接近电影文件，但远离与计算机相关的文件。盒子可以被认为是整个文件的世界。

在此处输入图像描述

任何帮助将不胜感激。

谢谢

jvc007

score 4 · Accepted Answer

One way to introduce a distance or similarity measure between documents is:

first encode your documents as vectors, eg using TF-IDF (see https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
the scalar-product between two vectors related to two documents give you a measure about the similarity of the documents. The larger this value is, the higher is the similarity.

Using MDS (http://en.wikipedia.org/wiki/Multidimensional_scaling) on these similarities should help to visualize the documents in a two dimensional plot.

score 2 · Accepted Answer

在保持相似性的同时将高维数据映射到低维空间的问题可以使用自组织映射（SOM 或 Kohonen 网络）来解决。我已经在文档上看到了一些应用程序。

我不知道任何 python 实现（可能有一个），但是 Matlab（SOM 工具箱）有一个很好的实现。

score 0 · Accepted Answer

我认为您正在寻找的是locality-sensitive hashing。请参阅此答案以获取精美的图形说明和示例代码。

python - 根据内容相似度在网格中排列文档

3 回答 3

Related

Reference