r - 如何在 mlr 包的 makeClassifTask() 中包含阻塞因子？

Question

在一些分类任务中，使用mlr包，我需要处理一个data.frame类似于这个的：

set.seed(pi)
# Dummy data frame
df <- data.frame(
   # Repeated values ID
   ID = sort(sample(c(0:20), 100, replace = TRUE)),
   # Some variables
   X1 = runif(10, 1, 10),
   # Some Label
   Label = sample(c(0,1), 100, replace = TRUE)
   )
df

我需要交叉验证模型，将值保持在一起ID，我从教程中知道：

https://mlr-org.github.io/mlr-tutorial/release/html/task/index.html#further-settings

我们可以在任务中包含一个阻塞因素。这将表明某些观察结果“属于一起”，并且在将数据拆分为训练集和测试集以进行重采样时不应分开。

问题是如何将这个阻塞因素包含在makeClassifTask?

不幸的是，我找不到任何例子。

score 4 · Accepted Answer

你有什么版本的mlr？一段时间以来，阻塞应该是其中的一部分。您可以直接在makeClassifTask

这是您的数据的示例：

df$ID = as.factor(df$ID)
df2 = df
df2$ID = NULL
df2$Label = as.factor(df$Label)
tsk = makeClassifTask(data = df2, target = "Label", blocking = df$ID)
res = resample("classif.rpart", tsk, resampling = cv10)

# to prove-check that blocking worked
lapply(1:10, function(i) {
  blocks.training = df$ID[res$pred$instance$train.inds[[i]]]
  blocks.testing = df$ID[res$pred$instance$test.inds[[i]]]
  intersect(blocks.testing, blocks.training)
})
#all entries are empty, blocking indeed works!

score 1 · Accepted Answer

@jakob-r 的答案不再有效。我的猜测是 cv10 发生了一些变化。

次要编辑以使用“blocking.cv = TRUE”

完整的工作示例：

set.seed(pi)
# Dummy data frame
df <- data.frame(
   # Repeated values ID
   ID = sort(sample(c(0:20), 100, replace = TRUE)),
   # Some variables
   X1 = runif(10, 1, 10),
   # Some Label
   Label = sample(c(0,1), 100, replace = TRUE)
   )
df 

df$ID = as.factor(df$ID)
df2 = df
df2$ID = NULL
df2$Label = as.factor(df$Label)
resDesc <- makeResampleDesc("CV",iters=10,blocking.cv = TRUE)
tsk = makeClassifTask(data = df2, target = "Label", blocking = df$ID)
res = resample("classif.rpart", tsk, resampling = resDesc)

# to prove-check that blocking worked
lapply(1:10, function(i) {
  blocks.training = df$ID[res$pred$instance$train.inds[[i]]]
  blocks.testing = df$ID[res$pred$instance$test.inds[[i]]]
  intersect(blocks.testing, blocks.training)
})

r - 如何在 mlr 包的 makeClassifTask() 中包含阻塞因子？

2 回答 2

Related

Reference