gensim - 更新 gensim word2vec 模型

Question

我在 gensim 中有一个 word2vec 模型，训练了超过 98892 个文档。对于句子数组中不存在的任何给定句子（即我训练模型的集合），我需要用该句子更新模型，以便下次查询它会给出一些结果。我这样做是这样的：

new_sentence = ['moscow', 'weather', 'cold']
model.train(new_sentence)

并将其打印为日志：

2014-03-01 16:46:58,061 : INFO : training model with 1 workers on 98892 vocabulary and 100 features
2014-03-01 16:46:58,211 : INFO : reached the end of input; waiting to finish 1 outstanding jobs
2014-03-01 16:46:58,235 : INFO : training on 10 words took 0.1s, 174 words/s

现在，当我用类似的 new_sentence 查询大多数肯定（as model.most_similar(positive=new_sentence)）时，它会给出错误：

Traceback (most recent call last):
 File "<pyshell#220>", line 1, in <module>
 model.most_similar(positive=['moscow', 'weather', 'cold'])
 File "/Library/Python/2.7/site-packages/gensim/models/word2vec.py", line 405, in most_similar
 raise KeyError("word '%s' not in vocabulary" % word)
  KeyError: "word 'cold' not in vocabulary"

这表明“冷”这个词不是我训练这件事的词汇的一部分（我是对的）吗？

所以问题是：如何更新模型，以便给出给定新句子的所有可能相似之处？

score 25 · Accepted Answer

train() expects a sequence of sentences on input, not one sentence.
train() only updates weights for existing feature vectors based on existing vocabulary. You cannot add new vocabulary (=new feature vectors) using train().

score 24 · Accepted Answer

从gensim 0.13.3 开始，可以使用 gensim 对 Word2Vec 进行在线培训。

model.build_vocab(new_sentences, update=True)
model.train(new_sentences)

score 8 · Accepted Answer

如果您的模型是使用 C 工具 load_word2vec_format 生成的，则无法更新该模型。请参阅在线培训Word2Vec 教程的 word2vec 教程部分：

请注意，无法使用 C 工具 load_word2vec_format() 生成的模型恢复训练。您仍然可以将它们用于查询/相似性，但那里缺少对培训（词汇树）至关重要的信息。

score 2 · Accepted Answer

首先，您不能在预训练模型中添加新词。

但是，2014 年发布了一个“新”的 doc2vec 模型，可以满足您的所有要求。您可以使用它来训练文档向量，而不是获取一组词向量然后将它们组合起来。最好的部分是 doc2vec 可以在训练后推断出看不见的句子。虽然模型还是不能改变的，但是根据我的实验，你可以得到一个相当不错的推理结果。

score 2 · Accepted Answer

问题是你不能用新的句子重新训练 word2vec 模型。只有 doc2vec 允许。尝试 doc2vec 模型。

score 1 · Accepted Answer

您可以添加到模型词汇表中，并使用FastText添加到嵌入中。

from gensim.models import FastText

在这里您可以看到一些 FastText 示例。在这里，您可以了解如何使用 FastText 对词汇表外 (OOV) 实例进行评分。

gensim - 更新 gensim word2vec 模型

6 回答 6

Related

Reference