python - 使用 Gensim 计算 2 个文档之间的 TF-IDF 相似度

Question

我正在使用 Gensim 来计算 2 个文档之间的相似度。由于某种原因， tfidf[corpus] 行返回一个空列表。我不知道为什么

    articles = []
#make a corpus by adding each of the top 25 documents to a list
for x in range(0,25):
    articles.append(str(WikiDoc(sorted_links[0]).jsonify()['text']))
#puts all of the top 25 documents into a list
texts = [[word for word in document.lower().split()] for document in articles]
print texts
#load precomputed dictionary
articles_dict = corpora.Dictionary(texts)
articles_dict.save('./articles.dict')
articles_dict = Dictionary.load('./articles.dict')
#articles_corpus = [articles_dict.doc2bow(text) for text in texts]
#corpora.MmCorpus.serialize('./articles.mm', articles_corpus)
corpus = [articles_dict.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('./articles.mm', corpus)
corpus = corpora.MmCorpus('./articles.mm')
#build the tfidf model based on the 25 documents so that we can find similarities 
#with respect to each of these documents
tfidf = models.TfidfModel(corpus)
#get the other document and process to produce dictionary representation
one_doc_bow = WikiDoc('SpongeBob')
one_doc_bow = articles_dict.doc2bow(one_doc_bow.jsonify()['text'].lower().split())
print tfidf[one_doc_bow]
top = tfidf[one_doc_bow]
corpus_tfidf = tfidf[corpus]

当我打印字典时，我得到：字典（2204 个唯一标记）当我打印 MmCorpus 时，我得到：MmCorpus（25 个文档，2204 个特征，55100 个非零条目）tfidf[corpus] yield []。谁能诊断我的问题？非常感谢！

python - 使用 Gensim 计算 2 个文档之间的 TF-IDF 相似度

0 回答 0

Related

Reference