7

I have some questions about the TfidfVectorizer.

It is unclear to me how the words are selected. We can give a minimum support, but after that, what will decide which features will be selected (e.g. higher support more chance)? If we say max_features = 10000, do we always get the same? If we say max_features = 12000, will we get the same 10000 features, but an extra added 2000?

Also, is there a way to extend the, say, max_features=20000 features? I fit it on some text, but I know of some words that should be included for sure, and also some emoticons ":-)" etc. How to add these to the TfidfVectorizer object, so that it will be possible to use the object, use it to fit and predict

to_include = [":-)", ":-P"]
method = TfidfVectorizer(max_features=20000, ngram_range=(1, 3),
                      # I know stopwords, but how about include words?
                      stop_words=test.stoplist[:100], 
                      # include words ??
                      analyzer='word',
                      min_df=5)
method.fit(traindata)

Sought result:

X = method.transform(traindata)
X
<Nx20002 sparse matrix of type '<class 'numpy.int64'>'
 with 1135520 stored elements in Compressed Sparse Row format>], 
 where N is sample size
4

1 回答 1

23

你问了几个不同的问题。让我分别回答他们:

“我不清楚这些词是如何选择的。”

文档中:

max_features : optional, None by default
    If not None, build a vocabulary that only consider the top
    max_features ordered by term frequency across the corpus.

所有特征(在你的例子中是一元、二元和三元)在整个语料库中按频率排序,然后10000选择顶部。不常见的词被扔掉了。

“如果我们说 max_features = 10000,我们总是得到相同的结果吗?如果我们说 max_features = 12000,我们会得到相同的 10000 个特征,但额外增加了 2000 个特征吗?”

是的。该过程是确定性的:对于给定的语料库和给定的max_features,您将始终获得相同的特征。

我把它放在一些文本上,但我知道一些肯定应该包含的词,[...] 如何将这些添加到 TfidfVectorizer 对象?

您使用该vocabulary参数来指定应使用哪些功能。例如,如果您只想提取表情符号,您可以执行以下操作:

emoticons = {":)":0, ":P":1, ":(":2}
vect = TfidfVectorizer(vocabulary=emoticons)
matrix = vect.fit_transform(traindata)

这将返回一个<Nx3 sparse matrix of type '<class 'numpy.int64'>' with M stored elements in Compressed Sparse Row format>]. 请注意,只有 3 列,每个功能一个。

如果您希望词汇表包含表情符号以及N最常见的特征,您可以先计算最常见的特征,然后将它们与表情符号合并并重新矢量化,如下所示:

# calculate the most frequent features first
vect = TfidfVectorizer(vocabulary=emoticons, max_features=10)
matrix = vect.fit_transform(traindata)
top_features = vect.vocabulary_
n = len(top_features)

# insert the emoticons into the vocabulary of common features
emoticons = {":)":0, ":P":1, ":(":2)}
for feature, index in emoticons.items():
    top_features[feature] = n + index

# re-vectorize using both sets of features
# at this point len(top_features) == 13
vect = TfidfVectorizer(vocabulary=top_features)
matrix = vect.fit_transform(traindata)
于 2013-11-03T14:49:26.077 回答