machine-learning - 机器学习，去除嘈杂的类（不是单个实例）

Question

我的问题是关于交叉验证 (CV) 后的数据集，它可以帮助我识别导致最大错误的类。例如，考虑以下 CV 数据：

               TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
                 0.194     0.015      0.315     0.194     0.24       0.786    A
                 0.369     0.024      0.571     0.369     0.449      0.844    B
                 0.096     0.015      0.167     0.096     0.122      0.688    C
                 0.478     0.015      0.558     0.478     0.515      0.858    D
                 0.648     0.01       0.768     0.648     0.703      0.904    E
                 0.481     0.019      0.82      0.481     0.606      0.928    F
                 0.358     0.012      0.646     0.358     0.461      0.862    G
                 1         0.001      0.973     1         0.986      1        H
                 0.635     0.005      0.825     0.635     0.717      0.959    I
                 0.176     0.003      0.667     0.176     0.278      0.923    J
                 0.999     0.346      0.717     0.999     0.835      0.984    K
Weighted Avg.    0.704     0.169      0.692     0.704     0.671      0.931

从这个例子中，很明显 K 类降低了组合精度（注意 FP 率，这在我的上下文中很重要）。现在我的问题是，从训练集中完全忽略 K 类是否明智？或者最好只为更准确的类考虑测试实例分类（例如，在这个例子中，除了 K 之外的任何类）。

我反对忽略整个类（例如 K）的论点是，一个人可能会强制一个实际上属于 K 类的测试实例来适应某个其他类，这似乎不合逻辑。

有什么输入吗？

谢谢

score 2 · Accepted Answer

This really depends on the actual problem you tackle, e.g.: do the classes reflect an objective ground-truth (e.g. trying to classify a text to the writer who wrote it) or are the classes arbitrary (e.g. classifying "round" vs. "non round" objects)? What are the relative weights of type-I vs. type-II errors, and how important is recall (coverage)?

However, a practical method I can suggest is hierarchical classification.

Specifically: using the CV confusion matrix, find pairs (or groups) of classes which are not neatly separated; group them together as a single class; and then train a secondary classifier to separate only the classes belonging to the group. This might lead to a more accurate classification, especially if you find out that in order to classify a specific group, another set of features/classification algorithms would be better.

For example, say your confusion matrix is:

       class/classified as
               |A |B |C |D 
              A|10|2 |1 |3
              B|0 |15|0 |1
              C|0 |0 |21|16
              D|0 |0 |9 |11

clearly, there is a large amount of confusion between C and D. you could retrain the same classifier with just 3 classes, A, B and E (C and D combined), then try separating only C and D with a new classifier whenever E is found.

score 0 · Accepted Answer

我的第一个想法是尝试找到一种为误报分配成本的方法，以降低 K 类的这种风险。

machine-learning - 机器学习，去除嘈杂的类（不是单个实例）

2 回答 2

Related

Reference