python - Gensim Doc2Vec - 将语料库句子传递给 Doc2Vec 函数

Question

我使用MySentences该类从目录中的所有文件中提取句子，并使用这些句子来训练word2vec模型。我的数据集未标记。

class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname

    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname)):
                yield line.split()

sentences = MySentences('sentences')
model = gensim.models.Word2Vec(sentences)

现在我想用那个类来制作一个doc2vec模型。我阅读了 Doc2Vec参考页。Doc2Vec()函数获取句子作为参数，但它不接受上述句子变量并返回错误：

AttributeError: 'list' object has no attribute 'words'

问题是什么？该参数的正确类型是什么？

更新：

我认为，未标记的数据是问题所在。似乎 doc2vec 需要标记数据。

score 2 · Accepted Answer

没有理由使用额外的类来解决问题。在库的新更新中，TaggedLineDocument添加了将句子转换为向量的新功能。

sentences = TaggedLineDocument(INPUT_FILE)

然后，训练模型

model = Doc2Vec(alpha=0.025, min_alpha=0.025)
model.build_vocab(sentences)

for epoch in range(10):
    model.train(sentences)
    model.alpha -= 0.002
    model.min_alpha = model.alpha
    print epoch

score 0 · Accepted Answer

与 word2vec 不同，doc2vec 需要用唯一的 id 标记每个火车条目。这是必要的，因为稍后当它预测相似性时，它的结果将是 doc ids（火车条目的唯一 id），就像 word 是 word2vec 的预测一样。

这是我的一段代码，它可以完成您想要实现的确切目标

 class DynamicCorpus(object):
 def __iter__(self):
     with open(csf_file) as fp:
         for line in fp:
             splt = line.split(':')
             text = splt[2].replace('\n', '')
             id = splt[0]
             yield TaggedDocument(text.split(), [id])

我的 csv 文件具有格式 id:text

稍后您可以将语料库提供给模型

coprus = DynamicCorpus()

d2v = doc2vec.Doc2Vec(min_count=15,
                      window=10,
                      vector_size=300,
                      workers=15,
                      alpha=0.025,
                      min_alpha=0.00025,
                      dm=1)
d2v.build_vocab(corpus)

for epoch in range(training_iterations):
    d2v.train(corpus, total_examples=d2v.corpus_count, epochs=d2v.iter)
    d2v.alpha -= 0.0002
    d2v.min_alpha = d2v.alpha

python - Gensim Doc2Vec - 将语料库句子传递给 Doc2Vec 函数

2 回答 2

Related

Reference