python - 在 python 中使用 gensim 的 LSI

Question

我正在使用 Python 的 gensim 库进行潜在语义索引。我按照网站上的教程进行操作，效果很好。现在我正在尝试对其进行一些修改；每次添加文档时，我都想运行 lsi 模型。

这是我的代码：

stoplist = set('for a of the and to in'.split())
num_factors=3
corpus = []

for i in range(len(urls)):
 print "Importing", urls[i]
 doc = getwords(urls[i])
 cleandoc = [word for word in doc.lower().split() if word not in stoplist]
 if i == 0:
  dictionary = corpora.Dictionary([cleandoc])
 else:
  dictionary.addDocuments([cleandoc])
 newVec = dictionary.doc2bow(cleandoc)
 corpus.append(newVec)
 tfidf = models.TfidfModel(corpus)
 corpus_tfidf = tfidf[corpus]
 lsi = models.LsiModel(corpus_tfidf, numTopics=num_factors, id2word=dictionary)
 corpus_lsi = lsi[corpus_tfidf]

geturls 是我编写的函数，它将网站的内容作为字符串返回。同样，如果我等到处理完所有文档后再执行 tfidf 和 lsi，它会起作用，但这不是我想要的。我想在每次迭代中都这样做。不幸的是，我收到此错误：

    Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "streamlsa.py", line 51, in <module>
    lsi = models.LsiModel(corpus_tfidf, numTopics=num_factors, id2word=dictionary)
  File "/Library/Python/2.6/site-packages/gensim-0.7.8-py2.6.egg/gensim/models/lsimodel.py", line 303, in __init__
    self.addDocuments(corpus)
  File "/Library/Python/2.6/site-packages/gensim-0.7.8-py2.6.egg/gensim/models/lsimodel.py", line 365, in addDocuments
    self.printTopics(5) # TODO see if printDebug works and remove one of these..
  File "/Library/Python/2.6/site-packages/gensim-0.7.8-py2.6.egg/gensim/models/lsimodel.py", line 441, in printTopics
    self.printTopic(i, topN = numWords)))
  File "/Library/Python/2.6/site-packages/gensim-0.7.8-py2.6.egg/gensim/models/lsimodel.py", line 433, in printTopic
    return ' + '.join(['%.3f*"%s"' % (1.0 * c[val] / norm, self.id2word[val]) for val in most])
  File "/Library/Python/2.6/site-packages/gensim-0.7.8-py2.6.egg/gensim/corpora/dictionary.py", line 52, in __getitem__
    return self.id2token[tokenid] # will throw for non-existent ids
KeyError: 1248

通常错误会在第二个文档上弹出。我想我明白它在告诉我什么（字典索引不好），我只是不知道为什么。我尝试了很多不同的东西，但似乎没有任何效果。有谁知道发生了什么？

谢谢！

score 4 · Accepted Answer

这是 gensim 中的一个错误，其中反向 id->word 映射被缓存，但缓存在addDocuments().

它在 2011 年的提交中得到了修复：https ://github.com/piskvorky/gensim/commit/b88225cfda8570557d3c72b0820fefb48064a049 。

score 1 · Accepted Answer

好的，所以我找到了一个解决方案，尽管不是最佳解决方案。

如果您使用制作字典，corpora.Dictionary然后立即添加文档dictionary.addDocuments，则一切正常。

但是，如果您在这两个调用之间使用字典（通过调用dictionary.doc2bow或将字典附加到带有的 lsi 模型id2word），那么您的字典将被“冻结”并且无法更新。你可以打电话dictionary.addDocuments，它会告诉你它已经更新，它甚至会告诉你新字典有多大，例如：

INFO:dictionary:built Dictionary(6627 unique tokens) from 8 documents (total 24054 corpus positions)

但是，当您引用任何新索引时，您会收到错误消息。我不确定这是否是一个错误或者这是有意的（无论出于何种原因），但至少 gensim报告成功将文档添加到字典这一事实肯定是一个错误。

首先，我尝试将任何字典调用放在单独的函数中，其中只应修改字典的本地副本。好吧，它仍然破裂。这对我来说很奇怪，我不知道为什么。

我的下一步是尝试使用copy.copy. 这可行，但显然会使用更多开销。但是，它将允许您维护语料库和字典的工作副本。不过，对我来说，这里最大的缺点是，这个解决方案不允许我使用删除在语料库中只出现一次的单词filterTokens，因为这需要修改字典。

我的另一个解决方案是在每次迭代中简单地重建所有内容（语料库、字典、lsi 和 tfidf 模型）。使用我的小样本数据集，这给了我稍微好一点的结果，但不能扩展到非常大的数据集而不会产生内存问题。不过，现在这就是我正在做的事情。

如果任何有经验的 gensim 用户有更好的（和更友好的）解决方案，这样我就不会遇到更大数据集的问题，请告诉我！

score 0 · Accepted Answer

在 doc2bow 中，您可以设置 allow_update = True ，它会在每次 doc2bow 迭代时自动更新您的字典

http://radimrehurek.com/gensim/corpora/dictionary.html

python - 在 python 中使用 gensim 的 LSI

3 回答 3

Related

Reference