9

我最近发现了 in 中的folds参数xgb.cv,它允许指定验证集的索引。xgb.cv.mknfold然后在 内调用辅助函数xgb.cv,然后将每个折叠的剩余索引作为相应折叠的训练集的索引。

问题:我可以通过 xgboost 接口中的任何接口同时指定训练和验证指数吗?

我的主要动机是执行时间序列交叉验证,我不希望将“非验证”索引自动分配为训练数据。一个例子来说明我想要做什么:

# assume i have 100 strips of time-series data, where each strip is X_i
# validate only on 10 points after training
fold1:  train on X_1-X_10, validate on X_11-X_20
fold2:  train on X_1-X_20, validate on X_21-X_30
fold3:  train on X_1-X_30, validate on X_31-X_40
...

目前,使用该folds参数将迫使我使用剩余的示例作为验证集,这大大增加了误差估计的方差,因为剩余的数据大大超过了训练数据,并且可能与训练数据有非常不同的分布,尤其是对于较早的折叠。这就是我的意思:

fold1:  train on X_1-X_10, validate on X_11-X100 # huge error
...

如果其他软件包方便的话,我愿意接受其他软件包的解决方案(即不需要我撬开开源代码),并且不会破坏原始 xgboost 实现中的效率。

4

2 回答 2

3

我认为问题的底部是错误的,应该说:

强迫我用剩下的例子作为训练

似乎提到的辅助功能xgb.cv.mknfold也不再存在了。注意我的xgboost版本是0.71.2.

然而,这似乎可以通过对 的小修改来相当直接地实现xgb.cv,例如:

xgb.cv_new <- function(params = list(), data, nrounds, nfold, label = NULL, 
          missing = NA, prediction = FALSE, showsd = TRUE, metrics = list(), 
          obj = NULL, feval = NULL, stratified = TRUE, folds = NULL, folds_train = NULL, 
          verbose = TRUE, print_every_n = 1L, early_stopping_rounds = NULL, 
          maximize = NULL, callbacks = list(), ...) {
  check.deprecation(...)
  params <- check.booster.params(params, ...)
  for (m in metrics) params <- c(params, list(eval_metric = m))
  check.custom.obj()
  check.custom.eval()
  if ((inherits(data, "xgb.DMatrix") && is.null(getinfo(data, 
                                                        "label"))) || (!inherits(data, "xgb.DMatrix") && is.null(label))) 
    stop("Labels must be provided for CV either through xgb.DMatrix, or through 'label=' when 'data' is matrix")
  if (!is.null(folds)) {
    if (!is.list(folds) || length(folds) < 2) 
      stop("'folds' must be a list with 2 or more elements that are vectors of indices for each CV-fold")
    nfold <- length(folds)
  }
  else {
    if (nfold <= 1) 
      stop("'nfold' must be > 1")
    folds <- generate.cv.folds(nfold, nrow(data), stratified, 
                               label, params)
  }
  params <- c(params, list(silent = 1))
  print_every_n <- max(as.integer(print_every_n), 1L)
  if (!has.callbacks(callbacks, "cb.print.evaluation") && verbose) {
    callbacks <- add.cb(callbacks, cb.print.evaluation(print_every_n, 
                                                       showsd = showsd))
  }
  evaluation_log <- list()
  if (!has.callbacks(callbacks, "cb.evaluation.log")) {
    callbacks <- add.cb(callbacks, cb.evaluation.log())
  }
  stop_condition <- FALSE
  if (!is.null(early_stopping_rounds) && !has.callbacks(callbacks, 
                                                        "cb.early.stop")) {
    callbacks <- add.cb(callbacks, cb.early.stop(early_stopping_rounds, 
                                                 maximize = maximize, verbose = verbose))
  }
  if (prediction && !has.callbacks(callbacks, "cb.cv.predict")) {
    callbacks <- add.cb(callbacks, cb.cv.predict(save_models = FALSE))
  }
  cb <- categorize.callbacks(callbacks)
  dall <- xgb.get.DMatrix(data, label, missing)
  bst_folds <- lapply(seq_along(folds), function(k) {
    dtest <- slice(dall, folds[[k]])
    if (is.null(folds_train))
      dtrain <- slice(dall, unlist(folds[-k]))
    else
      dtrain <- slice(dall, folds_train[[k]])
    handle <- xgb.Booster.handle(params, list(dtrain, dtest))
    list(dtrain = dtrain, bst = handle, watchlist = list(train = dtrain, 
                                                         test = dtest), index = folds[[k]])
  })
  rm(dall)
  basket <- list()
  num_class <- max(as.numeric(NVL(params[["num_class"]], 1)), 
                   1)
  num_parallel_tree <- max(as.numeric(NVL(params[["num_parallel_tree"]], 
                                          1)), 1)
  begin_iteration <- 1
  end_iteration <- nrounds
  for (iteration in begin_iteration:end_iteration) {
    for (f in cb$pre_iter) f()
    msg <- lapply(bst_folds, function(fd) {
      xgb.iter.update(fd$bst, fd$dtrain, iteration - 1, 
                      obj)
      xgb.iter.eval(fd$bst, fd$watchlist, iteration - 1, 
                    feval)
    })
    msg <- simplify2array(msg)
    bst_evaluation <- rowMeans(msg)
    bst_evaluation_err <- sqrt(rowMeans(msg^2) - bst_evaluation^2)
    for (f in cb$post_iter) f()
    if (stop_condition) 
      break
  }
  for (f in cb$finalize) f(finalize = TRUE)
  ret <- list(call = match.call(), params = params, callbacks = callbacks, 
              evaluation_log = evaluation_log, niter = end_iteration, 
              nfeatures = ncol(data), folds = folds)
  ret <- c(ret, basket)
  class(ret) <- "xgb.cv.synchronous"
  invisible(ret)
}

我刚刚添加了一个可选参数folds_train = NULL,稍后以这种方式在函数内部使用它(见上文):

if (is.null(folds_train))
  dtrain <- slice(dall, unlist(folds[-k]))
else
  dtrain <- slice(dall, folds_train[[k]])

然后您可以使用新版本的功能,例如如下:

# save original version
orig <- xgboost::xgb.cv

# devtools::install_github("miraisolutions/godmode")
godmode:::assignAnywhere("xgb.cv", xgb.cv_new)

# now you can use (call) xgb.cv with the additional argument

# once you are done, or may want to switch back to the original version
# (if you restart R you will also be back to the original version):
godmode:::assignAnywhere("xgb.cv", orig)

所以现在您应该能够使用额外的参数调用该函数,为训练数据提供额外的索引。

请注意,我没有时间对此进行测试。

于 2018-07-18T22:57:20.403 回答
0

根据xgboost::xgb.cv文档,您可以通过folds参数传递自定义测试索引(NULL默认情况下!)。它需要作为列表传递,其中每个元素都是索引向量。

例如,如果您想进行时间序列类型的拆分,您可以这样做:

create_test_idx <- function(size) {
  half_size <- round(size / 2)
  step <- round(0.1 * half_size)
  starts <- seq(from = half_size, to = size - step, by = step)
  return(lapply(starts, function(x) return(c(as.integer(x), as.integer(size)))))
}

my_custom_idx <- create_test_idx(nrow(my_train_data))

然后(例如),

xgbcv <- xgboost::xgb.cv(
    params = params,
    data = mydata,
    nrounds = 10000,
    folds = my_custom_idx,
    showsd = T,
    verbose = 0,
    early_stopping_rounds = 200,
    maximize = F
  )
于 2021-02-10T14:12:12.637 回答