我有类不平衡的数据(响应变量有两个类,其中一个类比另一个类明显更常见)。在这种情况下,准确率似乎不是训练模型的好指标(我可以获得 99% 的准确率并完全错误地分类少数类)。我认为使用 F1 分数会更有益。
有没有人尝试过使用 F1 分数作为 R 中的训练指标?我尝试修改 iris 数据集以将物种作为二元变量并运行随机森林。有人可以帮我调试吗?
library(caret)
library(randomForest)
data(iris)
iris$Species = ifelse(iris$Species == "setosa", "a", "b")
iris$Species = as.factor(iris$Species)
f1 <- function (data, lev = NULL, model = NULL) {
precision <- posPredValue(data$pred, data$obs, positive = "pass")
recall <- sensitivity(data$pred, data$obs, postive = "pass")
f1_val <- (2 * precision * recall) / (precision + recall)
names(f1_val) <- c("F1")
f1_val }
train.control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
classProbs = TRUE,
#sampling = "smote",
summaryFunction = f1,
search = "grid")
tune.grid <- expand.grid(.mtry = seq(from = 1, to = 10, by = 1))
random.forest.orig <- train(Species ~ ., data = iris,
method = "rf",
tuneGrid = tune.grid,
metric = "F1",
trControl = train.control)
给出以下错误:
Something is wrong; all the F1 metric values are missing:
F1
Min. : NA
1st Qu.: NA
Median : NA
Mean :NaN
3rd Qu.: NA
Max. : NA
NA's :10
Error: Stopping
In addition: There were 50 or more warnings (use warnings() to see the first 50)
5: stop("Stopping", call. = FALSE)
4: train.default(x, y, weights = w, ...)
3: train(x, y, weights = w, ...)
2: train.formula(Species ~ ., data = iris, method = "rf", tuneGrid = tune.grid,
metric = "F1", trControl = train.control)
1: train(Species ~ ., data = iris, method = "rf", tuneGrid = tune.grid,
metric = "F1", trControl = train.control)
> warnings()
Warning messages:
1: In randomForest.default(x, y, mtry = param$mtry, ...) :
invalid mtry: reset to within valid range
资料来源:使用 F1 度量的 Caret 训练模型