python - TfidfVectorizer的词汇表和get_features（）之间的区别？

Question

我有

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Train the vectorizer
text="this is a simple example"
singleTFIDF = TfidfVectorizer(ngram_range=(1,2)).fit([text])
singleTFIDF.vocabulary_ # show the word-matrix position pairs

# Analyse the training string - text
single=singleTFIDF.transform([text])
single.toarray()

我想将每个值关联到单个相应的功能中。现在单机是什么结构？如何将单个值的位置映射到特征？

如何解释词汇和 get_features() 的索引？他们有关系吗？根据文档，两者都具有索引功能。这很混乱？

score 5 · Accepted Answer

属性词汇表_输出一个字典，其中所有 ngram 都是字典键，各自的值是 tfidf 矩阵中每个 ngram（特征）的列位置。get_feature_names()方法输出一个列表，其中根据每个特征的列位置出现 ngram。因此，您可以使用其中任何一个来确定哪个 tfidf 列对应于哪个功能。在下面的示例中，使用 get_feature_names() 的输出来命名列，可以轻松地将 tfidf 矩阵转换为 pandas 数据框。另请注意，所有值都具有相同的权重，并且所有权重的平方和等于 1。

singleTFIDF.vocabulary_
Out[41]: 
{'this': 5,
 'is': 1,
 'simple': 3,
 'example': 0,
 'this is': 6,
 'is simple': 2,
 'simple example': 4}

singleTFIDF.get_feature_names_out()
Out[42]: ['example', 'is', 'is simple', 'simple', 'simple example', 'this', 'this is']

import pandas as pd
df = pd.DataFrame(single.toarray(), columns=singleTFIDF.get_feature_names())

df
Out[48]: 
    example        is  is simple    simple  simple example      this   this is
0  0.377964  0.377964   0.377964  0.377964        0.377964  0.377964  0.377964

python - TfidfVectorizer的词汇表和get_features（）之间的区别？

1 回答 1

Related

Reference