2

I'm training a spam detector using the MultinomialNB model in scikit-learn. I use the DictVectorizer class to transform tokens to word counts (i.e. features). I would like to be able to train the model over time using new data as it arrives (in this case in the form of chat messages incoming to our app server). For this, it looks like the partial_fit function will be useful.

However what I can't seem to figure out is how to enlarge the size of the DictVectorizer after it has been initially "trained". If new features/words arrive that have never been seen, they are simply ignored. What I would like to do is pickle the current version of the model and the DictVectorizer and update them each time we do a new training session. Is this possible?

4

1 回答 1

0

文档中,他们使用字典来完成 DictVectorizer 的学习阶段。您可能可以将新功能添加到原始字典并执行fit_transform. 这样您就可以将您的价值添加到 DictVectoriser。

小心使用 partial_fit 方法,这是一种重度处理。正如方法文档中所述,存在处理开销。

from sklearn.feature_extraction import DictVectorizer
v = DictVectorizer(sparse=False)
D = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}]
X = v.fit_transform(D)

# Learn and do treatment

# when new data come (value is a dictionary)
D.append(values)
X = v.fit_transform(D) # do the fit again

# 2 choices, 
# wait for more modification before learning 
# or learn each time you have modification (not really performant)
于 2015-04-15T09:40:21.590 回答