I have some questions about the TfidfVectorizer
.
It is unclear to me how the words are selected. We can give a minimum support, but after that, what will decide which features will be selected (e.g. higher support more chance)? If we say max_features = 10000
, do we always get the same? If we say max_features = 12000
, will we get the same 10000
features, but an extra added 2000
?
Also, is there a way to extend the, say, max_features=20000
features? I fit it on some text, but I know of some words that should be included for sure, and also some emoticons ":-)" etc. How to add these to the TfidfVectorizer
object, so that it will be possible to use the object, use it to fit
and predict
to_include = [":-)", ":-P"]
method = TfidfVectorizer(max_features=20000, ngram_range=(1, 3),
# I know stopwords, but how about include words?
stop_words=test.stoplist[:100],
# include words ??
analyzer='word',
min_df=5)
method.fit(traindata)
Sought result:
X = method.transform(traindata)
X
<Nx20002 sparse matrix of type '<class 'numpy.int64'>'
with 1135520 stored elements in Compressed Sparse Row format>],
where N is sample size