1

我创建了一个二元分类模型,它可以预测一篇文章是属于正类还是负类。我正在使用将 TF-IDF 与另一个功能一起输入 XGBoost 分类器。在训练/测试和交叉验证时,我的 AUC 分数非常接近 1。在测试我的坚持数据时,我得到了 0.5 分。这对我来说似乎很奇怪,所以我将相同的训练数据输入到我的模型中,即使这样也返回了 0.5 AUC 分数。下面的代码接受一个数据框,适合并转换为 tf-idf 向量,并将其全部格式化为 dMatrix。

def format_to_dmatrix(known_targets):
  y = known_targets['target']
  X = known_targets[['body', 'day_of_year']]
  X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=.1, random_state=42)

  tfidf.fit(X_train['body'])
  pickle.dump(tfidf.vocabulary_,open("tfidf_features.pkl","wb"))
  X_train_enc = tfidf.transform(X_train['body']).toarray()
  X_test_enc = tfidf.transform(X_test['body']).toarray()

  new_cols = tfidf.get_feature_names()
  new_cols.append('day_of_year')

  a = np.array(X_train['day_of_year'])
  a = a.reshape(a.shape[0], 1)
  b = np.array(X_test['day_of_year'])
  b = b.reshape(b.shape[0], 1)

  X_train = np.append(X_train_enc, a, axis=1)
  X_test = np.append(X_test_enc, b, axis=1)

  dtrain = xgb.DMatrix(X_train, label=y_train.values, feature_names=new_cols)
  dtest = xgb.DMatrix(X_test, label=y_test.values, feature_names=new_cols)
  return dtrain, dtest, tfidf

我交叉验证并找到 0.9979 的 test-auc-mean,所以我保存模型,如下所示。

best_model = xgb.train(
params,
dtrain,
num_boost_round=num_boost_round,
evals=[(dtest, "Test")]

这是我加载新数据的代码:

def test_newdata(data):
tf1 = pickle.load(open("tfidf_features.pkl", 'rb'))
tf1_new = TfidfVectorizer(max_features=1500, lowercase=True, analyzer='word', stop_words='english', ngram_range=(1, 1), vocabulary = tf1.keys())
encoded_body = tf1_new.fit_transform(data['body']).toarray()
new_cols = tf1_new.get_feature_names()
new_cols.append('day_of_year')
day_of_year = np.array(data['day_of_year'])
day_of_year = day_of_year.reshape(day_of_year.shape[0], 1)
formatted_test_data = np.append(encoded_body, day_of_year, axis=1)
df= pd.DataFrame(formatted_test_data, columns=new_cols)
return xgb.DMatrix(df)

下面的代码显示,尽管加载了相同的数据,但我的 AUC 得分为 0.5。我在某处错过了错误吗?

loaded_model = xgb.Booster()
loaded_model.load_model("earn_modelv3.model")

holdout = known_targets
formatted_test_data = test_newdata(holdout)

holdout_preds = loaded_model.predict(formatted_test_data)

predictions_binary = np.where(holdout_preds > .5, 1, 0)
{round(roc_auc_score(holdout['target'], predictions_binary) ,4)
4

0 回答 0