我在 sklearn.feature_extraction.text.TfidfVectorizer 中使用 nltk.stem.SnowballStemmer 来提高效率,但是有一个问题。
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk.stem
english_stemmer = nltk.stem.SnowballStemmer('english')
class StemmedTfidfVectorizer(TfidfVectorizer):
def build_analyzer(self):
analyzer = super(StemmedTfidfVectorizer,self).build_analyzer()
return lambda doc:(english_stemmer.stem(word) for word in analyzer(doc))
#list all the sentence
posts=["How to format my disks","hard disk formating at","How to formated my disks"]
#use tfidf class
vectorizer_tfidf=StemmedTfidfVectorizer(min_df=1,stop_words="english")
#
x_tfidf=vectorizer_tfidf.fit_transform(posts_root)
print("feature_name:%s" % vectorizer_tfidf.get_feature_names())
#
num_samples,num_features=x_tfidf.shape
print("samples_noroot: %d ,#features_noroot: %d" % (num_samples,num_features))
print(x_tfidf.toarray())
输出如下:
feature_name:[u'disk', u'format', u'hard']
samples_noroot: 3 ,#features_noroot: 3
[[ 0.70710678 0.70710678 0. ]
[ 0.45329466 0.45329466 0.76749457]
[ 0.70710678 0.70710678 0. ]]
“磁盘”一词在所有句子中,“磁盘”的权重应为0。如何修复代码