python - Bringing a classifier to production

Question

I've saved my classifier pipeline using joblib:

vec = TfidfVectorizer(sublinear_tf=True, max_df=0.5, ngram_range=(1, 3))
pac_clf = PassiveAggressiveClassifier(C=1)
vec_clf = Pipeline([('vectorizer', vec), ('pac', pac_clf)])
vec_clf.fit(X_train,y_train)
joblib.dump(vec_clf, 'class.pkl', compress=9)

Now i'm trying to use it in a production env:

def classify(title):

  #load classifier and predict
  classifier = joblib.load('class.pkl')

  #vectorize/transform the new title then predict
  vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, ngram_range=(1, 3))
  X_test = vectorizer.transform(title)
  predict = classifier.predict(X_test)
  return predict

The error i'm getting is: ValueError: Vocabulary wasn't fitted or is empty! I guess i should load the Vocabulary from te joblid but i can't get it to work

score 9 · Accepted Answer

只需更换：

  #load classifier and predict
  classifier = joblib.load('class.pkl')

  #vectorize/transform the new title then predict
  vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, ngram_range=(1, 3))
  X_test = vectorizer.transform(title)
  predict = classifier.predict(X_test)
  return predict

经过：

  # load the saved pipeline that includes both the vectorizer
  # and the classifier and predict
  classifier = joblib.load('class.pkl')
  predict = classifier.predict(X_test)
  return predict

class.pkl包括完整的管道，无需创建新的矢量化器实例。正如错误消息所说，您需要重用最初训练的矢量化器，因为从标记（字符串 ngram）到列索引的特征映射保存在矢量化器本身中。这种映射被命名为“词汇表”。

python - Bringing a classifier to production

1 回答 1

Related

Reference