algorithm - 设计相似度表

Question

我想不出更好的方法来解决以下问题......？想象一下，我有一张大桌子，其中的行和列是某种 ids .. 让我们说 book id

book_id-->1    2     3     .....
  1       1   0.92    0.33
  2
  3

此表中的条目告诉您每本书的相似程度。所以从上表中......第 1 本书和第 2 本书的相似度指数为 0.92。

所以，我已经在银行端计算了这个......让我们说“n”个条目。

从 n+1 开始，数据是实时的。

所以我要做的第一步是填充这个新行。这是一个非常幼稚的方法。

 i = 0; i < total_books ; i++
    sim(book(n+1),book(i))

可以说计算任何书籍相似度的计算都非常快。但由于这必须发生“n”次，这加起来..

如果有“m”本新书，那么它是一个 n^2 操作（我认为）。是否有更好的算法/数据结构可以使这种计算可接受。

另外，只是为了补充一些背景。这种相似性只不过是两个向量之间的点积。（谷歌搜索余弦相似度会给出一个想法）。但它没什么特别的......只是在两个向量之间取点积......它会返回一个介于 0 和 1 之间的值。

score 0 · Accepted Answer

When you add 1 book to a collection of n books, it performs n operations When you add m books to a collection of n books, it performs (n) + (n+1) + ... (n+m-1) operations which is (to be verified) : n*m + (1+2 + ... (m-1)) so it should be O(n*m + m*m).

If you have implemented your solution in a naive way, you can half the computation time by computing and storing sim(book_i,book_j) only when id(book_i) < id(book_j) (this does not change the complexity). Then, when you want to retrieve sim(i,j), you just have to make sure that you are using argument in the correct order.

algorithm - 设计相似度表

1 回答 1

Related

Reference