我有一个 .csv 格式的数据集,如图所示:
NRC_CLASS,L1_MARKS_FINAL,L2_MARKS_FINAL,L3_MARKS_FINAL,S1_MARKS_FINAL,S2_MARKS_FINAL,S3_MARKS_FINAL,
FAIL,7,12,12,24,4,30,
PASS,49,36,46,51,31,56,
FAIL,59,35,42,18,18,45,
PASS,61,30,51,33,30,52,
PASS,68,30,35,53,45,54,
2,82,77,75,32,36,56,
FAIL,18,35,35,32,21,35,
2,86,56,46,44,37,60,
1,94,45,62,70,50,59,
第一栏谈到整体成绩的地方:
FAIL - Fail
PASS - Pass class
1 - First class
2 - Second class
D - Distinction
接下来是每个学生在 6 个科目中的分数。
无论如何,我可以找出哪个主题对整体结果有影响的表现吗?
我正在使用 Weka 并使用 J48 来构建一棵树。
J48分类器的总结是:
=== Summary ===
Correctly Classified Instances 30503 92.5371 %
Incorrectly Classified Instances 2460 7.4629 %
Kappa statistic 0.902
Mean absolute error 0.0332
Root mean squared error 0.1667
Relative absolute error 10.8867 %
Root relative squared error 42.7055 %
Total Number of Instances 32963
此外,我将标记数据离散化为 10 个 bin,并将 useEqualFrequency 设置为 true。现在J48的总结是:
=== Summary ===
Correctly Classified Instances 28457 86.3301 %
Incorrectly Classified Instances 4506 13.6699 %
Kappa statistic 0.8205
Mean absolute error 0.0742
Root mean squared error 0.2085
Relative absolute error 24.3328 %
Root relative squared error 53.4264 %
Total Number of Instances 32963