r - R 中的级别 - 针对新数据集正确设置

Question

我在 R 中使用 randomForest。

我在一组包含因子变量的数据上进行训练。此变量具有以下级别：

[1] "Economics"    "Engineering"   "Medicine"
[4] "Accounting"   "Biology"       "Computer Science"
[7] "Physics"      "Law"           "Chemistry"

我的评估集包含这些级别的子集：

[1] "Law"          "Medicine"

randomForest 包要求级别相同，所以我尝试了：

levels(evaluationSet$course) <- levels(trainingSet$course)

但是当我检查评估集中的行时，值发生了变化：

evaluationSet[1:3,c('course')]
# Gives "[1] Economics Engineering Economics", should give "[1] Law Medicine Law"

我是 R 新手，但我认为这里发生的事情是因素是一个枚举集。在评估集中，“法律”和“医学”在因子中以数字表示（分别为 1 和 2）。当我应用新级别时，它会更改这些索引映射到的值。

我在 SO 上找到了一些类似的主题并尝试了他们的建议，但没有运气：

evaluationSet <- droplevels(evaluationSet)
levels(evaluationSet$course) <- levels(trainingSet$course)
evaluationSet$course <- factor(evaluationSet$course)

如何在保留数据值的同时将级别设置为与训练集相同？

编辑：在levels(evaluationSet$course) <-levels(trainingSet$course)之前和之后添加head(evaluationSet)的结果：

   timestamp score age takenBefore   course
1 1374910975  0.87  18           0      law
2 1374910975  0.81  21           0 medicine
3 1374910975  0.88  21           0      law
4 1374910975  0.88  21           0      law
5 1374910975  0.74  22           0      law
6 1374910975  0.76  23           1 medicine

   timestamp score age takenBefore      course
1 1374910975  0.87  18           0   economics
2 1374910975  0.81  21           0 engineering
3 1374910975  0.88  21           0   economics
4 1374910975  0.88  21           0   economics
5 1374910975  0.74  22           0   economics
6 1374910975  0.76  23           1 engineering

score 3 · Accepted Answer

你的直觉基本上是正确的。问题的关键在于级别的顺序很重要。它们不是一个集合，而是一个映射。

这是一个例子：

f <- factor(sample(letters[4:6],20,replace = TRUE))
> f
 [1] d e e d e e f d d f e e d d e e f e d d
Levels: d e f
> levels(f)
[1] "d" "e" "f"
> levels(f) <- letters[1:6]
> f
 [1] a b b a b b c a a c b b a a b b c b a a
Levels: a b c d e f

请注意，当我们添加关卡时，“前”三个关卡已被取代。反而，

> f <- factor(sample(letters[4:6],20,replace = TRUE))
> f
 [1] d f f e e d d f d d f d d e e e e f d e
Levels: d e f
> levels(f) <- c(letters[4:6],letters[1:3])
> f
 [1] d f f e e d d f d d f d d e e e e f d e
Levels: d e f a b c

因此，您只需要尊重评估集中当前级别的顺序。

考虑这一点的一种方法是因子实际上只是一个整数向量。无论哪里 R 编码一个 1 都将对应于第一级。而且由于它会按字母顺序排列它们，因此当您添加关卡时，您可能会弄乱该映射。

score 2 · Accepted Answer

如果您在内明确设置级别factor()，您应该会有更好的运气：

eval = read.table(text="   timestamp score age takenBefore   course
1 1374910975  0.87  18           0      law
2 1374910975  0.81  21           0 medicine
3 1374910975  0.88  21           0      law
4 1374910975  0.88  21           0      law
5 1374910975  0.74  22           0      law
6 1374910975  0.76  23           1 medicine", header=TRUE)
eval$course = factor(eval$course, levels=c("economics", "engineering", "medicine", "law"))

结果：

> eval$course
[1] law      medicine law      law      law      medicine
Levels: economics engineering medicine law

r - R 中的级别 - 针对新数据集正确设置

2 回答 2

Related

Reference