我想对 Movielense 用户表的人口统计数据进行分类,但 J48 的结果很奇怪,我用 C5.0 对我的数据进行分类,一切都很好但我必须研究这个算法(j48)
我的数据结构如下
$ user_id : int 1 2 3 4 5 6 7 8 9 10 ...
$ age : Factor w/ 7 levels "1","18","25",..: 1 7 3 5 3 6 4 3 3 4 ...
$ occupation: Factor w/ 21 levels "0","1","2","3",..: 11 17 16 8 21 10 2 13 18 2 ...
$ gender : Factor w/ 2 levels "F","M": 1 2 2 2 2 1 2 2 2 1 ...
$ Class : Factor w/ 4 levels "1","2","3","4": 2 2 2 2 3 2 2 2 2 4 ...
数据负责人是
head(data)
user_id age occupation gender Class
1 1 1 10 F 2
2 2 56 16 M 2
3 3 25 15 M 2
4 4 45 7 M 2
5 5 25 20 M 3
6 6 50 9 F 2
user_id
除arenominal type
和 should be之外的所有列factor in R
分类代码:
library(RWeka)
fit <- J48(data$Class~., data=data[,-c(1)], control = Weka_control(C=0.25))
currentUserClass = predict(fit,data[,-c(1)])
table(currentUserClass , data$Class)
汇总结果的错误表是
currentUserClass 1 2 3 4
1 0 0 0 0
2 216 3630 1549 645
3 0 0 0 0
4 0 0 0 0
当我用 C5.0 拟合我的模型时,结果如下所示,除了两种算法
predictions 1 2 3 4
1 216 0 0 0
2 0 3630 0 0
3 0 0 1549 0
4 0 0 0 645
更多尝试
- 我更改了数据的结构并将因子列转换为单独的列,但没有任何变化
- 我改变
C controller value
了结果会好一点,C=0.75
但这是完全错误的
规范化和更改数据后的事件没有发生
> head(data)
user_id age1 age18 age25 age35 age45 age50
1 1 5.1188737 -0.4726289 -0.7289391 -0.4960755 -0.3164894 -0.2990841
2 2 -0.1953231 -0.4726289 -0.7289391 -0.4960755 -0.3164894 -0.2990841
3 3 -0.1953231 -0.4726289 1.3716296 -0.4960755 -0.3164894 -0.2990841
4 4 -0.1953231 -0.4726289 -0.7289391 -0.4960755 3.1591400 -0.2990841
5 5 -0.1953231 -0.4726289 1.3716296 -0.4960755 -0.3164894 -0.2990841
6 6 -0.1953231 -0.4726289 -0.7289391 -0.4960755 -0.3164894 3.3429880
age56 occupation1 occupation2 occupation3 occupation4 occupation5
1 -0.2590882 -0.3094756 -0.2150398 -0.1717035 -0.3790765 -0.1374418
2 3.8590505 -0.3094756 -0.2150398 -0.1717035 -0.3790765 -0.1374418
3 -0.2590882 -0.3094756 -0.2150398 -0.1717035 -0.3790765 -0.1374418
4 -0.2590882 -0.3094756 -0.2150398 -0.1717035 -0.3790765 -0.1374418
5 -0.2590882 -0.3094756 -0.2150398 -0.1717035 -0.3790765 -0.1374418
6 -0.2590882 -0.3094756 -0.2150398 -0.1717035 -0.3790765 -0.1374418
occupation6 occupation7 occupation8 occupation9 occupation10 occupation11
1 -0.2016306 -0.3558574 -0.05312294 -0.1243576 5.4744311 -0.1477163
2 -0.2016306 -0.3558574 -0.05312294 -0.1243576 -0.1826371 -0.1477163
3 -0.2016306 -0.3558574 -0.05312294 -0.1243576 -0.1826371 -0.1477163
4 -0.2016306 2.8096490 -0.05312294 -0.1243576 -0.1826371 -0.1477163
5 -0.2016306 -0.3558574 -0.05312294 -0.1243576 -0.1826371 -0.1477163
6 -0.2016306 -0.3558574 -0.05312294 8.0399919 -0.1826371 -0.1477163
occupation12 occupation13 occupation14 occupation15 occupation16 occupation17
1 -0.2619865 -0.1551514 -0.2293967 -0.1562667 -0.2038431 -0.3010506
2 -0.2619865 -0.1551514 -0.2293967 -0.1562667 4.9049217 -0.3010506
3 -0.2619865 -0.1551514 -0.2293967 6.3982549 -0.2038431 -0.3010506
4 -0.2619865 -0.1551514 -0.2293967 -0.1562667 -0.2038431 -0.3010506
5 -0.2619865 -0.1551514 -0.2293967 -0.1562667 -0.2038431 -0.3010506
6 -0.2619865 -0.1551514 -0.2293967 -0.1562667 -0.2038431 -0.3010506
occupation18 occupation19 occupation20 genderM Class
1 -0.1082744 -0.1098287 -0.2208735 -1.5917949 2
2 -0.1082744 -0.1098287 -0.2208735 0.6281176 2
3 -0.1082744 -0.1098287 -0.2208735 0.6281176 2
4 -0.1082744 -0.1098287 -0.2208735 0.6281176 2
5 -0.1082744 -0.1098287 4.5267283 0.6281176 3
6 -0.1082744 -0.1098287 -0.2208735 -1.5917949 2
> fit <- J48(data$Class~., data=data, control = Weka_control(C=0.25))
> currentUserClass = predict(fit,data)
> table(currentUserClass , data$Class)
currentUserClass 1 2 3 4
1 7 1 2 2
2 201 3601 1470 617
3 8 28 75 14
4 0 0 2 12