python - 如何在 countvectorizer 中使用 max_features 参数对某些特征进行优先级排序

Question

我有一个工作程序，但我意识到测试数据中的一些重要 n-gram 不是我在训练数据中允许的 6500 个 max_features 的一部分。是否可以添加诸如“骗子”或“可怜”之类的特征作为我将使用我的训练数据进行训练的特征？

这是我目前用于制作矢量化器的内容：

vectorizer = CountVectorizer(ngram_range=(1, 2)
                            ,max_features=6500)
X = vectorizer.fit_transform(train['text'])
feature_names = vectorizer.get_feature_names()

score 0 · Accepted Answer

这是 hacky，你可能不能指望它在未来工作，但CountVectorizer主要依赖于学习属性vocabulary_，这是一个以标记作为键和“特征索引”作为值的字典。您可以添加到该字典中，一切似乎都按预期工作；借用文档中的示例：

from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))
X2 = vectorizer2.fit_transform(corpus)
print(X2.toarray())

## Output:
# [[0 0 1 1 0 0 1 0 0 0 0 1 0]
#  [0 1 0 1 0 1 0 1 0 0 1 0 0]
#  [1 0 0 1 0 0 0 0 1 1 0 1 0]
#  [0 0 1 0 1 0 1 0 0 0 0 0 1]]

# Now we tweak:
vocab_len = len(vectorizer2.vocabulary_)
vectorizer2.vocabulary_['new token'] = vocab_len  # append to end
print(vectorizer2.transform(["And this document has a new token"]).toarray())

## Output
# [[1 0 0 0 0 0 0 0 0 0 1 0 0 1]]

python - 如何在 countvectorizer 中使用 max_features 参数对某些特征进行优先级排序

1 回答 1

Related

Reference