1

我一直在使用 Weka 的 J48 决策树将 RSS 提要中的关键字频率分类为目标类别。而且我认为我可能无法将生成的决策树与报告的正确分类实例的数量和混淆矩阵中的数量相协调。

例如,我的 .arff 文件之一包含以下数据提取:

@attribute Keyword_1_nasa_Frequency numeric
@attribute Keyword_2_fish_Frequency numeric
@attribute Keyword_3_kill_Frequency numeric
@attribute Keyword_4_show_Frequency numeric
...
@attribute Keyword_64_fear_Frequency numeric
@attribute RSSFeedCategoryDescription {BFE,FCL,F,M, NCA, SNT,S}

@data
0,0,0,34,0,0,0,0,0,40,0,0,0,0,0,0,0,0,0,0,24,0,0,0,0,13,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
0,0,0,10,0,0,0,0,0,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
...
20,0,64,19,0,162,0,0,36,72,179,24,24,47,24,40,0,48,0,0,0,97,24,0,48,205,143,62,78,
0,0,216,0,36,24,24,0,0,24,0,0,0,0,140,24,0,0,0,0,72,176,0,0,144,48,0,38,0,284,
221,72,0,72,0,SNT
...
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,S

依此类推:总共有 64 个关键字(列)和 570 行,其中每行包含一天中某个关键字在提要中出现的频率。在这种情况下,有 10 天的 57 个提要,总共有 570 条要分类的记录。每个关键字都以代理编号为前缀,并以“频率”为后缀。

我对决策树的使用是使用 10 倍验证的默认参数。

Weka 报告如下:

Correctly Classified Instances         210               36.8421 %
Incorrectly Classified Instances       360               63.1579 %

使用以下混淆矩阵:

=== Confusion Matrix ===

   a   b   c   d   e   f   g   <-- classified as
  11   0   0   0  39   0   0 |   a = BFE
   0   0   0   0  60   0   0 |   b = FCL
   1   0   5   0  72   0   2 |   c = F
   0   0   1   0  69   0   0 |   d = M
   3   0   0   0 153   0   4 |   e = NCA
   0   0   0   0  90  10   0 |   f = SNT
   0   0   0   0  19   0  31 |   g = S

树如下:

Keyword_22_health_Frequency <= 0
|   Keyword_7_open_Frequency <= 0
|   |   Keyword_52_libya_Frequency <= 0
|   |   |   Keyword_21_job_Frequency <= 0
|   |   |   |   Keyword_48_pic_Frequency <= 0
|   |   |   |   |   Keyword_63_world_Frequency <= 0
|   |   |   |   |   |   Keyword_26_day_Frequency <= 0: NCA (461.0/343.0)
|   |   |   |   |   |   Keyword_26_day_Frequency > 0: BFE (8.0/3.0)
|   |   |   |   |   Keyword_63_world_Frequency > 0
|   |   |   |   |   |   Keyword_31_gaddafi_Frequency <= 0: S (4.0/1.0)
|   |   |   |   |   |   Keyword_31_gaddafi_Frequency > 0: NCA (3.0)
|   |   |   |   Keyword_48_pic_Frequency > 0: F (7.0)
|   |   |   Keyword_21_job_Frequency > 0: BFE (10.0/1.0)
|   |   Keyword_52_libya_Frequency > 0: NCA (31.0)
|   Keyword_7_open_Frequency > 0
|   |   Keyword_31_gaddafi_Frequency <= 0: S (32.0/1.0)
|   |   Keyword_31_gaddafi_Frequency > 0: NCA (4.0)
Keyword_22_health_Frequency > 0: SNT (10.0)

我的问题涉及将矩阵与树调和,反之亦然。据我了解结果,像 (461.0/343.0) 这样的评级表明有 461 个实例被归类为 NCA。但是当矩阵只显示 153 时,这怎么可能呢?我不知道如何解释这一点,所以欢迎任何帮助。

提前致谢。

4

1 回答 1

2

每个叶子的括号中的数字应读作(此叶子上此分类的总实例数/此叶子上不正确分类的数量)。

在您的第一个 NCA 叶示例中,它表示有 461 个测试实例被归类为 NCA,在这 461 个测试实例中,有 343 个错误分类。所以在那个叶子上有 461-343 = 118 个正确分类的实例。

查看您的决策树,请注意 NCA 也在其他叶子中。在 NCA 的 461 + 3 + 31 + 4 = 499 个总分类中,我计算出 118 + 3 + 31 + 4 = 156 个正确分类的实例。

您的混淆矩阵显示了 39 + 60 + 72 + 69 + 153 + 90 + 19 = 502 个 NCA 的总分类中的 153 个正确分类。

因此,树 (156/499) 和您的混淆矩阵 (153/502) 之间存在细微差别。

请注意,如果您从命令行运行 Weka,它会输出一棵树和一个混淆矩阵,用于测试所有训练数据以及使用交叉验证进行测试。请注意,您正在查看正确树的正确矩阵。Weka 输出训练和测试结果,产生两对矩阵和树。你可能把它们弄混了。

于 2012-08-08T18:27:10.763 回答