gensim - 使用 gensim 访问 docvector 的问题

Question

我正在尝试使用 gensim (ver 1.0.1)doc2vec来获取文档的余弦相似度。这应该相对简单，但是我在检索文档的向量时遇到了问题，所以我可以做余弦相似度。当我尝试按我在训练中给它的标签检索文档时，我得到一个关键错误。

例如， print(model.docvecs['4_99.txt']) 会告诉我没有4_99.txt.

但是，如果我打印print(model.docvecs.doctags)，我会看到如下内容： '4_99.txt_3': Doctag(offset=1644, word_count=12, doc_count=1)

因此，对于每个文档，似乎都doc2vec将每个句子保存为“文档名称下划线数字”

所以我要么 A) 训练不正确，要么 B) 不明白如何检索文档向量以便我可以做similarity(d1, d2)

有谁可以帮我离开这里吗？

这是我训练 doc2vec 的方法：

#Obtain txt abstracts and txt patents 
filedir = os.path.abspath(os.path.join(os.path.dirname(__file__)))
files = os.listdir(filedir)

#Doc2Vec takes [['a', 'sentence'], 'and label']
docLabels = [f for f in files if f.endswith('.txt')]

sources = {}  #{'2_139.txt': '2_139.txt'}
for lable in docLabels:
    sources[lable] = lable
sentences = LabeledLineSentence(sources)


model = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=8)
model.build_vocab(sentences.to_array())
for epoch in range(10):
    model.train(sentences.sentences_perm())

model.save('./a2v.d2v')

这使用这个类

class LabeledLineSentence(object):

def __init__(self, sources):
    self.sources = sources

    flipped = {}

    # make sure that keys are unique
    for key, value in sources.items():
        if value not in flipped:
            flipped[value] = [key]
        else:
            raise Exception('Non-unique prefix encountered')

def __iter__(self):
    for source, prefix in self.sources.items():
        with utils.smart_open(source) as fin:
            for item_no, line in enumerate(fin):
                yield LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no])

def to_array(self):
    self.sentences = []
    for source, prefix in self.sources.items():
        with utils.smart_open(source) as fin:
            for item_no, line in enumerate(fin):
                self.sentences.append(LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no]))
    return self.sentences

def sentences_perm(self):
    shuffle(self.sentences)
    return self.sentences

我从一个网络教程（https://medium.com/@klintcho/doc2vec-tutorial-using-gensim-ab3ac03d3a1）中获得了这门课，以帮助我解决 Doc2Vec 奇怪的数据格式要求，我不完全理解它是诚实。看起来这里写的这个类正在_n为每个句子添加，但在教程中似乎他们仍然检索文档向量，只给它文件名......那么我在这里做错了什么？

score 2 · Accepted Answer

gensim Doc2Vec 类使用您在训练期间传递给它的文档“标签”作为文档向量的键。

是的，LabeledLineSentence该类正在添加_n到文档标签中。具体来说，这些似乎是相关文件中的行号。

因此，您必须使用训练期间提供的相同键来请求向量，_n如果您真正想要的是每行向量。

如果您希望每个文件都成为自己的文档，则需要更改语料库类以将整个文件用作文档。查看您引用的教程，似乎他们有第二个LabeledLineSentence类不是面向行的（但仍以这种方式命名），但您没有使用该变体。

另外，您不需要循环和调用train()多次，并手动调整alpha. 在任何最新版本的 gensim 中，这几乎肯定不会按照您的意图进行，其中train()已经多次迭代语料库。在最新版本的 gensim 中，如果您这样称呼它，甚至会出现错误，因为网络上许多过时的示例都鼓励这种错误。

只需调用train()一次 - 它会在您的语料库中迭代模型构建时指定的次数。（默认为 5，但可通过iter初始化参数控制。而且，Doc2Vec 语料库中常见的为 10 或更多。）

gensim - 使用 gensim 访问 docvector 的问题

1 回答 1

Related

Reference