Find centralized, trusted content and collaborate around the technologies you use most.
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
我正在尝试计算大量动态文本文档之间的相似性。对于静态集,余弦相似度 + tf-idf 之类的东西会很好用。但是,我正在寻找一种方案,该方案允许我添加新文档而无需重新计算整个相似性集。有没有这样的算法存在?
你似乎接近解决方案。只需存储 f(document) 结果的每个实例,然后组合结果。
映射每个文档的词频并存储:
d0: "the" : 70, "quick" : 22, "fox" : 1 d1: "the" : 42, "lazy" : 2, "dog" : 13
合并文档并在聚合上进行评估:
d0_d1: "the" : 112. "lazy" : 2, "dog" : 13, "quick" : 22, "fox" : 1 tf_idf(d0_d1)