这是 hacky,你可能不能指望它在未来工作,但CountVectorizer
主要依赖于学习属性vocabulary_
,这是一个以标记作为键和“特征索引”作为值的字典。您可以添加到该字典中,一切似乎都按预期工作;借用文档中的示例:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))
X2 = vectorizer2.fit_transform(corpus)
print(X2.toarray())
## Output:
# [[0 0 1 1 0 0 1 0 0 0 0 1 0]
# [0 1 0 1 0 1 0 1 0 0 1 0 0]
# [1 0 0 1 0 0 0 0 1 1 0 1 0]
# [0 0 1 0 1 0 1 0 0 0 0 0 1]]
# Now we tweak:
vocab_len = len(vectorizer2.vocabulary_)
vectorizer2.vocabulary_['new token'] = vocab_len # append to end
print(vectorizer2.transform(["And this document has a new token"]).toarray())
## Output
# [[1 0 0 0 0 0 0 0 0 0 1 0 0 1]]