我在 gensim 中有一个 word2vec 模型,训练了超过 98892 个文档。对于句子数组中不存在的任何给定句子(即我训练模型的集合),我需要用该句子更新模型,以便下次查询它会给出一些结果。我这样做是这样的:
new_sentence = ['moscow', 'weather', 'cold']
model.train(new_sentence)
并将其打印为日志:
2014-03-01 16:46:58,061 : INFO : training model with 1 workers on 98892 vocabulary and 100 features
2014-03-01 16:46:58,211 : INFO : reached the end of input; waiting to finish 1 outstanding jobs
2014-03-01 16:46:58,235 : INFO : training on 10 words took 0.1s, 174 words/s
现在,当我用类似的 new_sentence 查询大多数肯定(as model.most_similar(positive=new_sentence)
)时,它会给出错误:
Traceback (most recent call last):
File "<pyshell#220>", line 1, in <module>
model.most_similar(positive=['moscow', 'weather', 'cold'])
File "/Library/Python/2.7/site-packages/gensim/models/word2vec.py", line 405, in most_similar
raise KeyError("word '%s' not in vocabulary" % word)
KeyError: "word 'cold' not in vocabulary"
这表明“冷”这个词不是我训练这件事的词汇的一部分(我是对的)吗?
所以问题是:如何更新模型,以便给出给定新句子的所有可能相似之处?