python - 当我尝试拥有文档的向量时如何解决 gensim KeyError？

Question

我阅读了以下代码来学习 doc2vec 模型。每个文档都定义为两行之间的文本/行：

线索web09-en0001-XX-XXXXX
end_clueweb09-en0001-XX-XXXXX

这是我的代码：

 path='/home/work/Step2/test-input/html'


alldocs = []  # will hold all docs in original order


for fname in os.listdir(path):
    with open(path+'/'+fname) as alldata:
        for line in alldata:
            docId= line
            print docId
            context= alldata.next()
            #print context
            tokens = gensim.utils.to_unicode(context).split()
            end=alldata.next()
            alldocs.append(LabeledSentence(tokens[:],[docId]))

model = Doc2Vec(alpha=0.025, min_alpha=0.025)  # use fixed learning rate
model.build_vocab(alldocs)
for epoch in range(10):
    model.train(alldocs)
    model.alpha -= 0.002  # decrease the learning rate
    model.min_alpha = model.alpha  # fix the learning rate, no decay

# store the model to mmap-able files
model.save(path+'/my_html_model.doc2vec')

但是当我写model.docvecs['clueweb09-en0001-01-34238']时出现错误，但是当我写model.docvecs[0]时我得到了结果。

这是我得到的错误：

    Traceback (most recent call last):
  File "getLearingDoc.py", line 40, in <module>
    print model.docvecs['clueweb09-en0001-01-34238']
  File "/home/flashkar/anaconda/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 341, in __getitem__
    return self.doctag_syn0[self._int_index(index)]
  File "/home/flashkar/anaconda/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 315, in _int_index
    return self.max_rawint + 1 + self.doctags[index].offset
KeyError: 'clueweb09-en0001-01-34238'

我没有 python 和 gensim 的经验，请告诉我如何解决这个问题。

score 0 · Accepted Answer

你确定在训练期间出现了一个完全正确 'clueweb09-en0001-01-34238'的标签——没有杂散的换行符/等吗？

model.docvecs.doctags您可以在dict的键或 list 中看到模型已知的所有字符串 doctags model.docvecs.offset2doctag。

python - 当我尝试拥有文档的向量时如何解决 gensim KeyError？

1 回答 1

Related

Reference