0

i am new in word/paragraph embedding and trying to understand via doc2vec in GENSIM. I would like to seek advice on whether my understanding is incorrect. My understanding is that doc2vec is potentially able to return documents that may have semantically similar content. As a test, i tried the following and have the following questions.

Question 1: I noted that every run of training with the exact same parameters and examples will result in a model that produces very different results from previous trains (E.g. Different vectors and different ranking of similar documents eveytime).. Why is this so indeterministic? As such, can this be reliably used for any practical work?

Question 2: Why am i not getting the tag ids of the top similar documents instead? Results: [('day',0.477),('2016',0.386)....

Question 2 answer: The problem was due to model.most_similar, should use model.docvecs.most_similar instead

Please advise if i misunderstood anything?

Data prep

I had created multiple documents with a sentence each. I had deliberately made it such that they are distinctly different semantically.

A: It is a fine summer weather, with the birds singing and sun shining bright.

B: It is a lovely day indeed, if only i had a degree in appreciating.

C: 2016-2017 Degree in Earth Science Earthly University

D: 2009-2010 Dip in Life and Nature Life College

Query: Degree in Philosophy from Thinking University from 2009 to 2010

Training

I trained the documents (tokens as words, running index as tag)

tdlist=[]
docstring=['It is a fine summer weather, with the birds singing and sun shining bright.',
           'It is a lovely day indeed, if only i had a degree in appreciating.',
           '2016-2017 Degree in Earth Science Earthly University',
           '2009-2010 Dip in Life and Nature Life College']
counter=1
for para in docstring:
   tokens=tokenize(para) #This will also strip punctuation 
   td=TaggedDocument(gensim.utils.to_unicode(str.encode(' '.join(tokens))).split(), str(counter))
   tdlist.append(td)
   counter=counter+1

model=gensim.models.Doc2Vec(tdlist,dm=0,alpha=0.025, size=20, min_alpha=0.025, min_count=0)
for epoch in range(200):
    model.train(tdlist, total_examples=model.corpus_count, epochs=model.iter)

Inference

I then attempted to infer the query. Although they are many missing words in the vocab for the query, i would expect closest document similarity results for C and D. But the results only gave me a list of 'words' followed by a similarity score. I am unsure if my understanding is wrong. Below is my code extract.

mydocvector=model.infer_vector(['Degree' ,'in' ,'Philosophy' ,'from' ,'Thinking' ,'University', 'from', '2009', 'to', '2010'])
print(model.docvecs.most_similar(positive=[mydocvector])
4

1 回答 1

0

Doc2Vec 在玩具大小的数据集上效果不佳 - 文档很少,总词数很少,每个文档的词数很少。您绝对需要比矢量维度 ( size) 更多的文档,理想情况下是数万或更多的文档。

第二个参数TaggedDocument应该是标签列表。通过提供单个整型字符串,它的每个元素(字符)都将被视为标签。(仅使用文档1对此4不会造成伤害,但是一旦您拥有 document 10,Doc2Vec 就会将其视为标签10,除非您将其作为['10'](单元素列表)提供。

是的,要查找您使用的最相似的文档,model.docvecs.most_similar()而不是model.most_similar()(仅对学习过的单词进行操作,如果有的话)。

您正在使用dm=0模式,这是一个非常好的开始想法——它速度很快,而且通常表现最佳。但请注意,这种模式也不会训练词向量。因此,您从顶级模型中要求的任何内容,例如model['summer']or model.most_similar('sun'),都将是基于随机初始化但从未训练过的单词的无意义结果。(如果你也需要训练单词,要么添加dbow_words=1dm=0模式,要么使用dm=1模式。但对于纯 doc-vectors,dm=0是一个不错的选择。)

没有必要train()在循环中调用 - 或者实际上根本不需要,考虑到它上面的行。您用来实例化 的表单Doc2Vec,使用实际语料库tdlist作为第一个参数,已经使用默认的传递次数 (5) 和提供的和触发了模型设置训练。现在,对于 Doc2Vec 训练,您通常需要更多遍(10 到 20 次很常见,尽管较小的数据集可能会受益更多)。对于任何训练,对于适当的梯度下降,您希望有效学习率逐渐下降到可以忽略不计的值,例如默认值(而不是强制与起始值相同)。iteralphamin_alphaalpha0.0001

您通常会显式调用的唯一情况是您在没有语料库的情况下train()实例化模型。在这种情况下,您需要调用(让模型使用发现的词汇表进行初始化),然后调用某种形式的- 但您仍然只需要一个调用来训练,提供所需的通过次数。(在 200 次迭代的外部循环内,允许默认的5 次通过,意味着总共 1000 次通过数据......并且都在同一个 fixed ,这不是正确的梯度下降。)model.build_vocab(tdlist)train()model.iteralpha

当您拥有更强大的数据集时,您可能会发现结果会随着更高的min_count. 通常只出现几次的词不能提供太多意义,因此只能作为噪声减慢训练并干扰其他向量变得更具表现力。(不要假设“更多的词必须等于更好的结果”。)扔掉单例或更多,通常会有所帮助。

关于推理,你的推理文本中几乎没有一个词在训练集中。(我只看到重复的“Degree”、“in”和“University”。)因此,除了上述所有问题之外,为示例文本推断一个好的向量也很困难。使用更丰富的训练集,您可能会获得更好的结果。它通常还有助于将steps可选参数增加到infer_vector()远高于其默认值 5。

于 2017-06-16T19:05:44.007 回答