statistics - 在 R 中使用 base_score 参数来解决 XGBoost 多类问题

Question

我试图了解 xgboost 如何解决多类问题。我使用 IRIS 数据集根据其特征和 R 中的计算结果来预测输入属于哪个物种。

代码如下

test <- as.data.frame(iris)
test$y <- ifelse(test$Species=="setosa",0,
                 (ifelse(test$Species=="versicolor",1,
                         (ifelse(test$Species=="virginica",2,3)))))

x_iris <- test[,c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width")]
y_iris <- test[,"y"]

iris_model <- xgboost(data = data.matrix(x_iris), label = y_iris, eta = 0.1, base_score = 0.5, nround=1, 
                     subsample = 1, colsample_bytree = 1, num_class = 3, max_depth = 4, lambda = 0,
                     eval_metric = "mlogloss", objective = "multi:softprob")

xgb.plot.tree(model = iris_model, feature_names = colnames(x_iris))

我尝试手动计算结果并将增益和覆盖值与 R 输出进行比较。我注意到了几件事：

无论我们在 R 中的“base_score”参数中提供什么，初始概率始终为1/（类数）。“base_score”实际上是在最后添加到最终 log_odds值的，并且它与 R 输出匹配时我们运行预测函数来获取赔率的对数。在二元分类的情况下，“base_score”参数用作模型的初始概率。

predict(iris_model,data.matrix(x_iris), reshape = TRUE, outputmargin = FALSE)

对于多类问题，损失函数为(2.0f * p * (1.0f - p) * wt) ，对于二元问题，损失函数为 ( p * (1.0f - p) * wt) 。

github repo https://github.com/dmlc/xgboost/issues/638中有对损失函数的解释，但没有关于为什么最后添加 base_score 的信息。

是因为 R 中的算法是这样设计的，还是 XGBoost 多类算法是这样工作的？

statistics - 在 R 中使用 base_score 参数来解决 XGBoost 多类问题

0 回答 0

Related

Reference