我正在调用 ranger 来对大型混合数据框架的多分类问题进行建模(其中一些分类变量的级别超过 53 个)。训练和测试运行没有任何问题。但是,解释混淆矩阵/列联表会打嗝。
我使用 iris 数据来解释我面临的困难,将 Species 视为分类变量,
library(ranger)
library(caret)
# Data
idx = sample(nrow(iris),100)
data = iris
# Split data sets
Train_Set = data[idx,]
Test_Set = data[-idx,]
# Train
Species.ranger <- ranger(Species ~ ., ,data=Train_Set,importance="impurity", save.memory = TRUE, probability=TRUE)
# Test
probabilitiesSpecies <- predict(Species.ranger, data = Test_Set,type='response', verbose = TRUE)
or
probabilitiesSpecies <- as.data.frame(predict(Species.ranger, data = Test_Set,type='response', verbose = TRUE)$predictions)
遇到以下困难:
table(Test_Set$Species, probabilitiesSpecies$predictions)
Error in table(Test_Set$Species, probabilitiesSpecies$predictions) :
all arguments must have the same length
或者
caret::confusionMatrix(Test_Set$Species, probabilitiesSpecies$predictions)
or
caret::confusionMatrix(table(Test_Set$Species, max.col(probabilitiesSpecies)-1))
gives
Error: `data` and `reference` should be factors with the same levels.
然而,下面显示的二分类是有效的:
idx = sample(nrow(iris),100)
data = iris
data$Species = factor(ifelse(data$Species=="virginica",1,0))
Train_Set = data[idx,]
Test_Set = data[-idx,]
# Train
Species.ranger <- ranger(Species ~ ., ,data=Train_Set,importance="impurity", save.memory = TRUE, probability=TRUE)
# Test
probabilitiesSpecies <- as.data.frame(predict(Species.ranger, data = Test_Set,type='response', verbose = TRUE)$predictions)
caret::confusionMatrix(table(max.col(probabilitiesSpecies)-1, Test_Set$Species))
如何解决这个问题以进行多分类以获得混淆矩阵?我也将其设置为单独的线程(使用 ranger 计算多分类混淆矩阵时出错)