python - scikit-learn TfidfVectorizer 忽略某些单词

Question

我正在尝试使用 TfidfVectorizer 从维基百科页面上获取关于葡萄牙历史的句子。但是我注意到该TfidfVec.fit_transform方法忽略了某些单词。这是我试过的句子：

sentence = "The oldest human fossil is the skull discovered in the Cave of Aroeira in Almonda."

TfidfVec = TfidfVectorizer()
tfidf = TfidfVec.fit_transform([sentence])

cols = [words[idx] for idx in tfidf.indices]
matrix = tfidf.todense()
pd.DataFrame(matrix,columns = cols,index=["Tf-Idf"])

数据框的输出：

本质上，它忽略了“Aroeira”和“Almonda”这两个词。

但我不希望它忽略这些话，我该怎么办？我在他们谈论这个的文档上找不到任何地方。

另一个问题是为什么要重复“the”这个词？该算法是否应该只考虑一个“the”并计算其 tf-idf？

score 5 · Accepted Answer

tfidf.indices只是 TfidfVectorizer 中特征名称的索引。通过这个索引从句子中获取单词是错误的。

您应该为您的 df 获取列名称TfidfVec.get_feature_names()

score 1 · Accepted Answer

输出是给出两个 the 因为你在句子中有两个。对整个句子进行编码，并为每个索引获取值。其他两个词没有出现的原因是因为它们是稀有词。您可以通过降低阈值使它们出现。

参考 min_df 和 max_features：
http ://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

python - scikit-learn TfidfVectorizer 忽略某些单词

2 回答 2

Related

Reference