我正在使用这个gensim教程来查找文本之间的相似之处。这是代码
from gensim import corpora, models, similarities
from gensim.models import hdpmodel, ldamodel
from itertools import izip
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
'''
documents = ["Human machine interface for lab abc computer applications",
"bags loose tea water second ingredient tastes water",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey",
"red cow butter oil"]
'''
documents = ["Human machine interface for lab abc computer applications",
"bags loose tea water second ingredient tastes water"]
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once]
for text in texts]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
#print corpus
tfidf = models.TfidfModel(corpus)
#print tfidf
corpus_tfidf = tfidf[corpus]
#print corpus_tfidf
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
lsi.print_topics(1)
lda = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=2)
lda.print_topics(1)
corpora.MmCorpus.serialize('dict.mm', corpus)
corpus = corpora.MmCorpus('dict.mm')
#print corpus
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
doc = "human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]
#print vec_lsi
index = similarities.MatrixSimilarity(lsi[corpus])
index.save('dict.index')
index = similarities.MatrixSimilarity.load('dict.index')
sims = index[vec_lsi]
#print list(enumerate(sims))
sims = sorted(enumerate(sims),key=lambda item: -item[1])
for sim in sims:
print documents[sim[0]], " ==> ", sim[1]
这里有两个文件。一个有 10 个文本,另一个有 2 个文本。一个被注释掉了。如果我使用第一个文档列表,一切都会正常并生成有意义的输出。如果我使用第二个文档列表(有 2 个文本),则会发生错误。就这个
/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:122: UserWarning: indices array has non-integer dtype (float64)
% self.indices.dtype.name )
此错误背后的原因是什么,我该如何解决?我使用的是 64 位机器。