python - catboost：带有观察权重的评估/测试集

Question

我正在处理一个包含人员列表（由财政代码索引）的数据集。目标变量是二进制的（1：买书，0：否则）。所有的预测变量都是分类的（例如：国籍、城市、道路、收入箱等）。一个财务代码可以重复两次，并且每个实例/观察值都有一个权重（如果不重复，则为 1，如果重复，则为 0 和 1 之间的值）。

例如，数据集看起来像

财政代码 | 重量 | 目标 | 分类信息

AAAAA1 | 0.98 | 0 |……

AAAAA1 | 0.02 | 1 |........

我有两个数据集（具有相同的变量），一个用于训练（X_train = 分类变量矩阵，y_train 是目标变量，train_weight 是训练集中每个观察的权重）和一个用于测试（具有相同的变量和含义：X_test、y_test 和 test_weight）。

我尝试了一个 Catboost 模型 - CatBoostClassifier。

初始化助推器和超参数

categorical_features_indices = np.where(X.dtypes == np.category)[0]

模型 = CatBoostClassifier（迭代次数=5000，学习率=0.1，深度=7，损失函数='Logloss'，eval_metric='AUC'）

拟合模型

model.fit(X_train,

        y_train,
         eval_set=(X_test,y_test),
         cat_features=categorical_features_indices,
         use_best_model=True,
         verbose=True,
         sample_weight=train_weight)

问题是：我如何才能考虑到 TEST 集中的观察结果也有权重 (test_weight) ？你有什么主意吗？

我阅读了https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_catboostregressor_fit-docpage/上的文档，但我没有发现任何有用的东西，而不是 lightgbm 文档（如果考虑另一个提升模型）。

score 0 · Accepted Answer

我的理解是这是您需要使用池的情况，即

model.fit(Pool(X_train,y_train,weight=train_weight)
      eval_set=Pool(X_test,y_test,weight=test_weight),
      cat_features=categorical_features_indices,
      use_best_model=True,
      verbose=True)

python - catboost：带有观察权重的评估/测试集

初始化助推器和超参数

拟合模型

1 回答 1

Related

Reference