0

I have around 20k documents with 60 - 150 words. Out of these 20K documents, there are 400 documents for which the similar document are known. These 400 documents serve as my test data.

I am trying to find similar documents for these 400 datasets using gensim doc2vec. The paper "Distributed Representations of Sentences and Documents" says that "The combination of PV-DM and PV-DBOW often work consistently better (7.42% in IMDB) and therefore recommended."

So I would like to combine the vectors of these two methods and find cosine similarity with all the train documents and select the top 5 with the least cosine distance.

So what's the effective method to combine the vectors of these 2 methods: adding or averaging or any other method ???

After combining these 2 vectors I can normalise each vector and then find the cosine distance.

4

1 回答 1

1

该论文暗示他们已经连接了这两种方法的向量。例如,给定一个 300d PV-DBOW 向量和一个 300d PV-DM 向量,您将在连接后得到一个 600d 的文本向量。

但是,请注意,他们在 IMDB 上的底线结果很难让外人重现。我的测试有时只显示了这些连接向量的小优势。(我特别想知道通过分离级联模型的 300d PV-DBOW + 300d PV-DM 是否比仅在相同的时间内以更少的步骤/并发症训练真正的 600d 模型更好。)

gensim您可以在其docs/notebooks目录中包含的示例笔记本之一中查看我重复原始“段落向量”论文的一些实验的演示:

https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb

除其他外,它包括一些步骤和有用的方法,用于将模型对视为一个连接的整体。

于 2019-08-06T20:22:26.560 回答