2

I need to compare two groups of documents (e.g. one group might have 1000 documents) and determine which document of the second group is the most similar to the certain document in the first group. Thus far, I used TF/IDF and cosine similarity but I need something more faster and accurate like TF/IDF :) Can you suggest me some faster algorithm or improvement of TF/IDF time?

4

1 回答 1

0

这取决于您要匹配的差异类型。我知道的最快的方法是使用与 minHash 匹配的 shingle: http: //www.stanford.edu/~ashishg/amdm/handouts/scribed-lec10.pdf http://en.wikipedia.org/wiki/MinHash

它用于查找接近/精确的重复文件,而不是部分相似的文档。

于 2013-07-09T16:22:37.540 回答