我有一个事件率小于 3% 的数据集(即大约有 700 条记录为 1 类,27000 条记录为 0 类)。
ID V1 V2 V3 V5 V6 Target
SDataID3 161 ONE 1 FOUR 0 0
SDataID4 11 TWO 2 THREE 2 1
SDataID5 32 TWO 2 FOUR 2 0
SDataID7 13 ONE 1 THREE 2 0
SDataID8 194 TWO 2 FOUR 0 0
SDataID10 63 THREE 3 FOUR 0 1
SDataID11 89 ONE 1 FOUR 0 0
SDataID13 78 TWO 2 FOUR 0 0
SDataID14 87 TWO 2 THREE 1 0
SDataID15 81 ONE 1 THREE 0 0
SDataID16 63 ONE 3 FOUR 0 0
SDataID17 198 ONE 3 THREE 0 0
SDataID18 9 TWO 3 THREE 0 0
SDataID19 196 ONE 2 THREE 2 0
SDataID20 189 TWO 2 ONE 1 0
SDataID21 116 THREE 3 TWO 0 0
SDataID24 104 ONE 1 FOUR 0 0
SDataID25 5 ONE 2 ONE 3 0
SDataID28 173 TWO 3 FOUR 0 0
SDataID29 5 ONE 3 ONE 3 0
SDataID31 87 ONE 3 FOUR 3 0
SDataID32 5 ONE 2 THREE 1 0
SDataID34 45 ONE 1 FOUR 0 0
SDataID35 19 TWO 2 THREE 0 0
SDataID37 133 TWO 2 FOUR 0 0
SDataID38 8 ONE 1 THREE 0 0
SDataID39 42 ONE 1 THREE 0 0
SDataID43 45 ONE 1 THREE 1 0
SDataID44 45 ONE 1 FOUR 0 0
SDataID45 176 ONE 1 FOUR 0 0
SDataID46 63 ONE 1 THREE 3 0
我正在尝试使用决策树找出拆分。但是树的结果只有 1 个根。
> library(rpart)
> tree <- rpart(Target ~ ., data=subset(train, select=c( -Record.ID) ),method="class")
> printcp(tree)
Classification tree:
rpart(formula = Target ~ ., data = subset(train, select = c(-Record.ID)), method = "class")
Variables actually used in tree construction:
character(0)
Root node error: 749/18239 = 0.041066
n= 18239
CP nsplit rel error xerror xstd
1 0 0 1 0 0
在阅读了 StackOverflow 上的大部分资源后,我放松/调整了控制参数,这给了我所需的决策树。
> tree <- rpart(Target ~ ., data=subset(train, select=c( -Record.ID) ),method="class" ,control =rpart.control(minsplit = 1,minbucket=2, cp=0.00002))
> printcp(tree)
Classification tree:
rpart(formula = Target ~ ., data = subset(train, select = c(-Record.ID)),
method = "class", control = rpart.control(minsplit = 1, minbucket = 2,
cp = 2e-05))
Variables actually used in tree construction:
[1] V5 V2 V1
[4] V3 V6
Root node error: 749/18239 = 0.041066
n= 18239
CP nsplit rel error xerror xstd
1 0.00024275 0 1.00000 1.0000 0.035781
2 0.00019073 20 0.99466 1.0267 0.036235
3 0.00016689 34 0.99199 1.0307 0.036302
4 0.00014835 54 0.98798 1.0334 0.036347
5 0.00002000 63 0.98665 1.0427 0.036504
当我修剪这棵树时,它产生了一棵带有单个节点的树。
> pruned.tree <- prune(tree, cp = tree$cptable[which.min(tree$cptable[,"xerror"]),"CP"])
> printcp(pruned.tree)
Classification tree:
rpart(formula = Target ~ ., data = subset(train, select = c(-Record.ID)),
method = "class", control = rpart.control(minsplit = 1, minbucket = 2,
cp = 2e-05))
Variables actually used in tree construction:
character(0)
Root node error: 749/18239 = 0.041066
n= 18239
CP nsplit rel error xerror xstd
1 0.00024275 0 1 1 0.035781
树不应该只给出根节点,因为从数学上讲,在给定节点(提供示例)上,我们正在获得信息增益。我不知道我是否在修剪时犯了错误,或者 rpart 在处理低事件率数据集时存在问题?
NODE p 1-p Entropy Weights Ent*Weight # Obs
Node 1 0.032 0.968 0.204324671 0.351398601 0.071799404 10653
Node 2 0.05 0.95 0.286396957 0.648601399 0.185757467 19663
Sum(Ent*wght) 0.257556871
Information gain 0.742443129