2

我正在使用这个gensim教程来查找文本之间的相似之处。这是代码

from gensim import corpora, models, similarities
from gensim.models import hdpmodel, ldamodel
from itertools import izip

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

'''
documents = ["Human machine interface for lab abc computer applications",
              "bags loose tea water second ingredient tastes water",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey",
              "red cow butter oil"]
'''
documents = ["Human machine interface for lab abc computer applications",
              "bags loose tea water second ingredient tastes water"]

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once]
         for text in texts]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

#print corpus

tfidf = models.TfidfModel(corpus)

#print tfidf

corpus_tfidf = tfidf[corpus]

#print corpus_tfidf

lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
lsi.print_topics(1)

lda = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=2)
lda.print_topics(1)

corpora.MmCorpus.serialize('dict.mm', corpus)
corpus = corpora.MmCorpus('dict.mm')
#print corpus

lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
doc = "human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]
#print vec_lsi

index = similarities.MatrixSimilarity(lsi[corpus])
index.save('dict.index')
index = similarities.MatrixSimilarity.load('dict.index')

sims = index[vec_lsi]
#print list(enumerate(sims))

sims = sorted(enumerate(sims),key=lambda item: -item[1])
for sim in sims:
  print documents[sim[0]], " ==> ", sim[1]

这里有两个文件。一个有 10 个文本,另一个有 2 个文本。一个被注释掉了。如果我使用第一个文档列表,一切都会正常并生成有意义的输出。如果我使用第二个文档列表(有 2 个文本),则会发生错误。就这个

/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:122: UserWarning: indices array has non-integer dtype (float64)
% self.indices.dtype.name )

此错误背后的原因是什么,我该如何解决?我使用的是 64 位机器。

4

2 回答 2

2

这可能是由于您的第二个列表将在[[], ['water']]您删除单例时引起,尝试对维度为 0 和 1 的矩阵进行矩阵运算可能会导致各种问题。

玩弄你的代码:

>>> corpus = [dictionary.doc2bow(text) for text in texts]
>>> corpus
[[], [(0, 2)]]
>>> tfidf = models.TfidfModel(corpus)
2013-07-21 09:23:31,415 : INFO : collecting document frequencies
2013-07-21 09:23:31,415 : INFO : PROGRESS: processing document #0
2013-07-21 09:23:31,415 : INFO : calculating IDF weights for 2 documents and 1 features (1 matrix non-zeros)
>>> corpus = [[(1,)], [(0,2)]]
>>> tfidf = models.TfidfModel(corpus)
2013-07-21 09:24:16,452 : INFO : collecting document frequencies
2013-07-21 09:24:16,452 : INFO : PROGRESS: processing document #0
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/tfidfmodel.py", line 96, in __init__
    self.initialize(corpus)
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/tfidfmodel.py", line 119, in initialize
    for termid, _ in bow:
ValueError: need more than 1 value to unpack
>>> corpus = [[(1,3)], [(0,2)]]
>>> tfidf = models.TfidfModel(corpus)
2013-07-21 09:24:26,892 : INFO : collecting document frequencies
2013-07-21 09:24:26,892 : INFO : PROGRESS: processing document #0
2013-07-21 09:24:26,892 : INFO : calculating IDF weights for 2 documents and 2 features (2 matrix non-zeros)
>>> 

正如我上面所说,您需要确保在调用它之前没有corpus任何空列表。models.TfidfModel(corpus)

于 2013-07-20T19:25:28.127 回答
0

这不是错误,而是警告。你可以忽略它。

在第二种情况下,您的查询文档doc为空,这会导致警告。无论如何,您仍然会得到正确的答案(=空vec_lsi)。

于 2013-12-04T22:44:49.327 回答