r - 如何使用`catboost`选择nrounds？

Question

如果我理解正确catboost，我们需要使用 CV调整nrounds就像 in 一样。我在[8]官方教程xgboost中看到如下代码

params_with_od <- list(iterations = 500,
                       loss_function = 'Logloss',
                       train_dir = 'train_dir',
                       od_type = 'Iter',
                       od_wait = 30)
model_with_od <- catboost.train(train_pool, test_pool, params_with_od)

哪个结果最好iterations= 211。

我的问题是：

是否正确：此命令使用test_pool来选择最佳iterations而不是使用交叉验证？
如果是，catboost 是否提供iterations从 CV 中选择最佳的命令，还是我需要手动执行？

score 1 · Accepted Answer

Catboost 正在进行交叉验证以确定最佳迭代次数。train_pool 和 test_pool 都是包含目标变量的数据集。在他们写的教程的前面

train_path = '../R-package/inst/extdata/adult_train.1000'
test_path = '../R-package/inst/extdata/adult_test.1000'

column_description_vector = rep('numeric', 15)
cat_features <- c(3, 5, 7, 8, 9, 10, 11, 15)
for (i in cat_features)
    column_description_vector[i] <- 'factor'

train <- read.table(train_path, head=F, sep="\t", colClasses=column_description_vector)
test <- read.table(test_path, head=F, sep="\t", colClasses=column_description_vector)
target <- c(1)
train_pool <- catboost.from_data_frame(data=train[,-target], target=train[,target])
test_pool <- catboost.from_data_frame(data=test[,-target], target=test[,target])

当您执行 catboost.train(train_pool, test_pool, params_with_od) 时，train_pool 用于训练，而 test_pool 用于通过交叉验证确定最佳迭代次数。

现在你感到困惑是对的，因为在本教程的后面，他们再次使用 test_pool 和拟合模型进行预测（model_best 类似于 model_with_od，但使用不同的过拟合检测器 IncToDec）：

prediction_best <- catboost.predict(model_best, test_pool, type = 'Probability')

这可能是不好的做法。现在他们可能会用他们的 IncToDec 过拟合检测器侥幸逃脱——我不熟悉它背后的数学原理——但是对于 Iter 类型的过拟合检测器，你需要有单独的训练、验证和测试数据集（如果你想成为在保存方面，对 IncToDec 过拟合检测器执行相同的操作）。然而，这只是一个展示功能的教程，所以我不会对他们已经如何使用哪些数据过于迂腐。

这里有一个关于过拟合检测器的更多细节的链接： https ://tech.yandex.com/catboost/doc/dg/concepts/overfitting-detector-docpage/

score 1 · Accepted Answer

将迭代次数基于一个 test_pool 和 catboost.train() 的最佳迭代是一个非常糟糕的决定。这样做，您将参数调整到一个特定的测试集，您的模型将无法很好地处理新数据。因此，您假设像 XGBoost 一样，您需要应用 CV 来找到最佳迭代次数是正确的。
catboost中确实有CV功能。你应该做的是指定大量的迭代，并在一定数量的轮次后停止训练，使用参数 early_stopping_rounds 没有改进。不幸的是，与 LightGBM 不同的是，catboost 似乎没有在 CV 之后自动提供最佳提升轮数的选项，以应用于 catboost.train()。因此，它需要一些解决方法。这是一个应该工作的例子：

    library(catboost)
    library(data.table)

    parameter = list(
      thread_count = n_cores,
      loss_function = "RMSE",
      eval_metric = c("RMSE","MAE","R2"),
      iterations = 10^5, # Train up to 10^5 rounds
      early_stopping_rounds = 100, # Stop after 100 rounds of no improvement
    )

    # Apply 6-fold CV
    model = catboost.cv(
        pool = train_pool,
        fold_count = 6,
        params = parameter
      )

      # Transform output to DT
      setDT(cbt_occupancy)
      model[, iterations := .I]
      # Order from lowest to highgest RMSE
      setorder(model, test.RMSE.mean)
      # Select iterations with lowest RMSE
      parameter$iterations = model[1, iterations]

      # Train model with optimal iterations
      model = catboost.train(
        learn_pool = train_pool,
        test_pool = test_pool,
        params = parameter
      )

score 0 · Accepted Answer

我认为这是 xgboost 和 catboost 的普遍问题。的选择nround伴随着学习率的选择。因此，我推荐更高的轮次（1000+）和低学习率。在找到最佳炒作参数并重试较低的学习率以检查您选择的炒作参数是否稳定后。

而且我发现@nikitxskv 的回答具有误导性。

在R 教程中，在 [12]中只是选择learning_rate = 0.1而没有多项选择。因此，没有nround调整的提示。
实际上，在 [12]中只是使用函数expand.grid来找到最佳炒作参数。它对的选择起作用depth，gamma依此类推。
而在实践中，我们并没有使用这种方式来寻找合适的nround（太长）。

现在针对这两个问题。

是否正确：此命令使用 test_pool 来选择最佳迭代而不是使用交叉验证？

是的，但你可以使用简历。

如果是，catboost 是否提供从 CV 中选择最佳迭代的命令，还是我需要手动执行？

这取决于你自己。如果你对提升过度拟合有很大的反感，我建议你尝试一下。有很多包可以解决这个问题。我推荐tidymodel包裹。

r - 如何使用`catboost`选择nrounds？

3 回答 3

Related

Reference