python - 如何在 GoogleNews-vectors-negative300.bin 预训练模型中添加缺失词向量？

Question

我在 python 中使用 gensim word2vec 库并使用预训练的 GoogleNews-vectors-negative300.bin 模型。但，

我的语料库中有词，但我没有词向量，因此我得到了 keyError 我该如何解决这个问题？

这是我到目前为止所尝试的，

1：加载`GoogleNews-vectors-negative300.bin`每个训练的模型：

model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
print "model loaded..."

2：使用推文中所有词向量的平均值构建训练集的词向量，然后进行缩放

def buildWordVector(text, size):
vec = np.zeros(size).reshape((1, size))
count = 0.
for word in text:
    try:
        vec += model[word].reshape((1, size))
        count += 1.
        #print "found! ",  word
    except KeyError:
        print "not found! ",  word #missing words
        continue
if count != 0:
    vec /= count
return vec

trained_vecs = np.concatenate([buildWordVector(z, n_dim) for z in x_train])

请告诉如何在预训练的 Word2vec 模型中添加新单词？

score 1 · Accepted Answer

编辑 2019/06/07

正如@Oleg Melnikov 和https://rare-technologies.com/word2vec-tutorial/#online_training__resuming所指出的那样，没有词汇树就不可能恢复训练（在使用 C 代码训练后不会保存是完全的）

请注意，无法使用 C 工具 load_word2vec_format() 生成的模型恢复训练。您仍然可以将它们用于查询/相似性，但那里缺少对培训（词汇树）至关重要的信息。

获取预训练的向量 - 例如。谷歌新闻
在 gensim 中加载模型
继续在 gensim 中训练模型

这些命令可能会派上用场

# Loading pre-trained vectors
model = Word2Vec.load_word2vec_format('/tmp/vectors.bin', binary=True)

# Training the model with list of sentences (with 4 CPU cores)
model.train(sentences, workers=4)

python - 如何在 GoogleNews-vectors-negative300.bin 预训练模型中添加缺失词向量？

这是我到目前为止所尝试的，

1：加载GoogleNews-vectors-negative300.bin每个训练的模型：

2：使用推文中所有词向量的平均值构建训练集的词向量，然后进行缩放

1 回答 1

Related

Reference

1：加载`GoogleNews-vectors-negative300.bin`每个训练的模型：