1

我的问题是关于交叉验证 (CV) 后的数据集,它可以帮助我识别导致最大错误的类。例如,考虑以下 CV 数据:

               TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
                 0.194     0.015      0.315     0.194     0.24       0.786    A
                 0.369     0.024      0.571     0.369     0.449      0.844    B
                 0.096     0.015      0.167     0.096     0.122      0.688    C
                 0.478     0.015      0.558     0.478     0.515      0.858    D
                 0.648     0.01       0.768     0.648     0.703      0.904    E
                 0.481     0.019      0.82      0.481     0.606      0.928    F
                 0.358     0.012      0.646     0.358     0.461      0.862    G
                 1         0.001      0.973     1         0.986      1        H
                 0.635     0.005      0.825     0.635     0.717      0.959    I
                 0.176     0.003      0.667     0.176     0.278      0.923    J
                 0.999     0.346      0.717     0.999     0.835      0.984    K
Weighted Avg.    0.704     0.169      0.692     0.704     0.671      0.931

从这个例子中,很明显 K 类降低了组合精度(注意 FP 率,这在我的上下文中很重要)。现在我的问题是,从训练集中完全忽略 K 类是否明智?或者最好只为更准确的类考虑测试实例分类(例如,在这个例子中,除了 K 之外的任何类)。

我反对忽略整个类(例如 K)的论点是,一个人可能会强制一个实际上属于 K 类的测试实例来适应某个其他类,这似乎不合逻辑。

有什么输入吗?

谢谢

4

2 回答 2

2

This really depends on the actual problem you tackle, e.g.: do the classes reflect an objective ground-truth (e.g. trying to classify a text to the writer who wrote it) or are the classes arbitrary (e.g. classifying "round" vs. "non round" objects)? What are the relative weights of type-I vs. type-II errors, and how important is recall (coverage)?

However, a practical method I can suggest is hierarchical classification.

Specifically: using the CV confusion matrix, find pairs (or groups) of classes which are not neatly separated; group them together as a single class; and then train a secondary classifier to separate only the classes belonging to the group. This might lead to a more accurate classification, especially if you find out that in order to classify a specific group, another set of features/classification algorithms would be better.

For example, say your confusion matrix is:

       class/classified as
               |A |B |C |D 
              A|10|2 |1 |3
              B|0 |15|0 |1
              C|0 |0 |21|16
              D|0 |0 |9 |11

clearly, there is a large amount of confusion between C and D. you could retrain the same classifier with just 3 classes, A, B and E (C and D combined), then try separating only C and D with a new classifier whenever E is found.

于 2013-06-19T11:35:05.007 回答
0

我的第一个想法是尝试找到一种为误报分配成本的方法,以降低 K 类的这种风险。

于 2013-06-18T22:02:06.943 回答