python - 为什么 `gensim` 中的 tf-idf 模型在我转换语料库后会丢弃术语和计数？

Question

为什么gensim我转换语料库后 tf-idf 模型会丢弃术语和计数？

我的代码：

from gensim import corpora, models, similarities

# Let's say you have a corpus made up of 2 documents.
doc0 = [(0, 1), (1, 1)]
doc1 = [(0,1)]
doc2 = [(0, 1), (1, 1)]
doc3 = [(0, 3), (1, 1)]

corpus = [doc0,doc1,doc2,doc3]

# Train a tfidf model using the corpus
tfidf = models.TfidfModel(corpus)

# Now if you print the corpus, it still remains as the flat frequency counts.
for d in corpus:
  print d
print 

# To convert the corpus into tfidf, re-initialize the corpus 
# according to the model to get the normalized frequencies.
corpus = tfidf[corpus]

for d in corpus:
  print d

输出：

[(0, 1.0), (1, 1.0)]
[(0, 1.0)]
[(0, 1.0), (1, 1.0)]
[(0, 3.0), (1, 1.0)]

[(1, 1.0)]
[]
[(1, 1.0)]
[(1, 1.0)]

score 6 · Accepted Answer

IDF 是通过将文档总数除以包含该术语的文档数，然后取该商的对数来获得的。在您的情况下，所有文档都有 term0，因此 term0 的 IDF 是 log(1)，等于 0。所以在您的 doc-term 矩阵中，term0 的列全为零。

出现在所有文档中的术语权重为零，它绝对不携带任何信息。

python - 为什么 `gensim` 中的 tf-idf 模型在我转换语料库后会丢弃术语和计数？

1 回答 1

Related

Reference