python - Gensim Doc2Vec 模型只生成有限数量的向量

Question

我正在使用 gensim Doc2Vec模型来生成我的特征向量。这是我正在使用的代码（我已经解释了代码中的问题）：

cores = multiprocessing.cpu_count()

# creating a list of tagged documents
training_docs = []

# all_docs: a list of 53 strings which are my documents and are very long (not just a couple of sentences)
for index, doc in enumerate(all_docs):
    # 'doc' is in unicode format and I have already preprocessed it
    training_docs.append(TaggedDocument(doc.split(), str(index+1)))

# at this point, I have 53 strings in my 'training_docs' list 

model = Doc2Vec(training_docs, size=400, window=8, min_count=1, workers=cores)

# now that I print the vectors, I only have 10 vectors while I should have 53 vectors for the 53 documents that I have in my training_docs list.
print(len(model.docvecs))
# output: 10

我只是想知道我是否做错了，或者是否应该设置任何其他参数？

更新：我正在使用TaggedDocument中的tags参数，当我将其更改为文本和数字的混合时，例如：Doc1、Doc2、...我看到生成向量的计数不同，但我仍然没有具有与预期相同数量的特征向量。

score 1 · Accepted Answer

查看它在您的语料库中发现的实际标签：

print(model.docvecs.offset2doctag)

你看到一个模式吗？

每个文档的tags属性应该是标签列表，而不是单个标签。如果您提供一个简单的整数字符串，它会将其视为数字列表，因此只学习标签'0', '1', ..., '9'。

您可以替换str(index+1)为[str(index+1)]并获得您期望的行为。

但是，由于您的文档 ID 只是升序整数，您也可以只使用纯 Python 整数作为您的文档标签。这将节省一些内存，避免创建从字符串标签到数组槽（int）的查找字典。为此，请将替换str(index+1)为[index]。（这会从 doc-ID 开始0——这有点像 Pythonic，也避免了0在保存训练向量的原始数组中浪费一个未使用的位置。）

python - Gensim Doc2Vec 模型只生成有限数量的向量

1 回答 1

Related

Reference