matlab - 朴素贝叶斯分类器和判别分析的准确性还差得很远

Question

所以我有两种分类方法，判别分析diaglinear分类（朴素贝叶斯）和matlab中实现的纯朴素贝叶斯分类器，整个数据集中有23个类。第一种方法判别分析：

%% Classify Clusters using Naive Bayes Classifier and classify
training_data = Testdata; 
target_class = TestDataLabels;

[class, err]  = classify(UnseenTestdata, training_data, target_class,'diaglinear')

cmat1 = confusionmat(UnseenTestDataLabels, class);
acc1 = 100*sum(diag(cmat1))./sum(cmat1(:));
fprintf('Classifier1:\naccuracy = %.2f%%\n', acc1);
fprintf('Confusion Matrix:\n'), disp(cmat1)

从混淆矩阵中获得81.49%的准确率，错误率 ( err) 为0.5040（不知道如何解释）。

第二种方法朴素贝叶斯分类器：

%% Classify Clusters using Naive Bayes Classifier
training_data = Testdata; 
target_class = TestDataLabels;
%# train model
nb = NaiveBayes.fit(training_data, target_class, 'Distribution', 'mn');

%# prediction
class1 = nb.predict(UnseenTestdata); 

%# performance
cmat1 = confusionmat(UnseenTestDataLabels, class1);
acc1 = 100*sum(diag(cmat1))./sum(cmat1(:));
fprintf('Classifier1:\naccuracy = %.2f%%\n', acc1);
fprintf('Confusion Matrix:\n'), disp(cmat1)

产生81.89%的准确率。

我只做了一轮交叉验证，我是 matlab 和监督/无监督算法的新手，所以我自己做了交叉验证。我基本上只是把 10% 的数据放在一边用于测试目的，因为它每次都是随机的。我可以通过它几次并取平均准确度，但结果将用于解释目的。

所以对于我的问题。

在我对当前方法的文献回顾中，许多研究人员发现将单一分类算法与聚类算法混合可以产生更好的准确度结果。他们通过为他们的数据找到最佳数量的集群并使用分区集群（应该更相似）通过分类算法运行每个单独的集群来做到这一点。一个过程，您可以将无监督算法的最佳部分与监督分类算法结合使用。

现在，我正在使用一个在文学作品中多次使用的数据集，并且我正在尝试一种与其他人不太相似的方法。

我首先使用了简单的 K-Means 聚类，它令人惊讶地具有很好的聚类数据的能力。输出如下所示：

在此处输入图像描述

查看每个集群 (K1, K2...K12) 类标签：

%% output the class labels of each cluster
K1 = UnseenTestDataLabels(indX(clustIDX==1),:)

我发现主要每个集群在 9 个集群中都有一个类标签，而 3 个集群包含多个类标签。表明 K-means 对数据有很好的拟合。

然而，问题是一旦我拥有每个集群数据（cluster1，cluster2...cluster12）：

%% output the real data of each cluster
cluster1 = UnseenTestdata(clustIDX==1,:)

我将每个集群通过朴素贝叶斯或判别分析，如下所示：

class1  = classify(cluster1, training_data, target_class, 'diaglinear');
class2  = classify(cluster2, training_data, target_class, 'diaglinear');
class3  = classify(cluster3, training_data, target_class, 'diaglinear');
class4  = classify(cluster4, training_data, target_class, 'diaglinear');
class5  = classify(cluster5, training_data, target_class, 'diaglinear');
class6  = classify(cluster6, training_data, target_class, 'diaglinear');
class7  = classify(cluster7, training_data, target_class, 'diaglinear');
class8  = classify(cluster8, training_data, target_class, 'diaglinear');
class9  = classify(cluster9, training_data, target_class, 'diaglinear');
class10  = classify(cluster10, training_data, target_class, 'diaglinear'); 
class11  = classify(cluster11, training_data, target_class, 'diaglinear');
class12  = classify(cluster12, training_data, target_class, 'diaglinear');

准确率变得可怕，50% 的聚类以 0% 的准确率分类，每个分类的聚类（acc1，acc2，...acc12）都有自己对应的混淆矩阵你可以在这里看到每个聚类的准确率：

在此处输入图像描述

所以我的问题/问题是：我哪里出错了？我首先想到的可能是集群的数据/标签混合在一起，但是我在上面发布的内容看起来是正确的，我看不出它有什么问题。

为什么与第一个实验中使用的未见 10% 数据完全相同的数据会为相同的未见聚类数据产生如此奇怪的结果？我的意思是应该注意，NB 是一个稳定的分类器，不应该轻易过度拟合，并且看到训练数据很大，而要分类的集群是并发的过度拟合不应该发生？

编辑：

根据评论的要求，我已将 cmat 文件包含在第一个测试示例中，其准确度为81.49%，错误为0.5040：

在此处输入图像描述

在此示例（cluster4）中还要求提供 K、class 和相关 cmat 的片段，准确度为3.03%：

在此处输入图像描述

看到有大量的类（总共 23 个），我决定减少1999 年 KDD 杯中概述的类，这只是应用了一些领域知识，因为一些攻击比其他攻击更相似并且属于一个总称。

然后我用 44.4 万条记录训练分类器，同时保留 10% 用于测试目的。

准确率更差73.39%错误率也更差0.4261

在此处输入图像描述

看不见的数据分为以下几类：

DoS: 39149
Probe: 405
R2L: 121
U2R: 6
normal.: 9721

类别或分类标签（判别分析的结果）：

DoS: 28135
Probe: 10776
R2L: 1102
U2R: 1140
normal.: 8249

训练数据由以下部分组成：

DoS: 352452
Probe: 3717
R2L: 1006
U2R: 49
normal.: 87395

我担心如果我降低训练数据以具有相似百分比的恶意活动，那么分类器将没有足够的预测能力来区分类别，但是查看其他一些文献我注意到一些研究人员删除了 U2R，因为没有t 足够的数据来成功分类。

到目前为止我尝试过的方法是一类分类器，我训练分类器只预测一个类（无效），对单个集群进行分类（精度更差），减少类标签（第二好）并保留完整的 23 个类标签（最佳精度）。

score 1 · Accepted Answer

正如其他人正确指出的那样，这里至少存在一个问题：

class1  = classify(cluster1, training_data, target_class, 'diaglinear');
...

您正在使用所有 training_data 训练分类器，但仅在子集群上对其进行评估。为了对数据进行聚类以产生任何效果，您需要在每个子聚类中训练不同的分类器。有时这可能非常困难——例如，Y 类的集群 C 中可能很少（或没有！）示例。这是尝试进行联合聚类和学习所固有的。

您的问题的一般框架如下：

Training data:
   Cluster into C clusters
   Within each cluster, develop a classifier

Testing data:
   Assign observation into one of the C clusters (either "hard", or "soft")
   Run the correct classifier (corresponding to that cluster)

这

class1  = classify(cluster1, training_data, target_class, 'diaglinear');

不这样做。

score 1 · Accepted Answer

这是一个非常简单的示例，它准确地显示了它应该如何工作以及出了什么问题

%% Generate data and labels for each class
x11 = bsxfun(@plus,randn(100,2),[2 2]);
x10 = bsxfun(@plus,randn(100,2),[0 2]);

x21 = bsxfun(@plus,randn(100,2),[-2 -2]);
x20 = bsxfun(@plus,randn(100,2),[0 -2]);

%If you have the PRT (shameless plug), this looks nice:
%http://www.mathworks.com/matlabcentral/linkexchange/links/2947-pattern-recognition-toolbox
% ds = prtDataSetClass(cat(1,x11,x21,x10,x20),prtUtilY(200,200));

x = cat(1,x11,x21,x10,x20);
y = cat(1,ones(200,1),zeros(200,1));

clusterIdx = kmeans(x,2); %make 2 clusters
xCluster1 = x(clusterIdx == 1,:);
yCluster1 = y(clusterIdx == 1);
xCluster2 = x(clusterIdx == 2,:);
yCluster2 = y(clusterIdx == 2);


%Performance is terrible:
yOut1  = classify(xCluster1, x, y, 'diaglinear');
yOut2  = classify(xCluster2, x, y, 'diaglinear');

pcCluster = length(find(cat(1,yOut1,yOut2) == cat(1,yCluster1,yCluster2)))/size(y,1)

%Performance is Good:
yOutCluster1  = classify(xCluster1, xCluster1, yCluster1, 'diaglinear');
yOutCluster2  = classify(xCluster2, xCluster2, yCluster2, 'diaglinear');

pcWithinCluster = length(find(cat(1,yOutCluster1,yOutCluster2) == cat(1,yCluster1,yCluster2)))/size(y,1)

%Performance is Bad (using all data):
yOutFull  = classify(x, x, y, 'diaglinear');
pcFull = length(find(yOutFull == y))/size(y,1)

score -1 · Accepted Answer

看看你cmat1的第一个例子的数据（准确率为 81.49%），你得到高准确率的主要原因是你的分类器得到了大量的class 1和class 4正确的。几乎所有其他类都表现不佳（得到零正确预测）。这与您的上一个示例一致（首先使用 k-means），其中 cluster7 的 acc7 为 56.9698。

编辑：似乎在中cmat1，我们没有超过一半的类的测试数据（查看全零行）。所以你只能知道类的一般性能1和4良好，如果你先进行聚类，将获得相似的性能。但是对于其他类，这并不能证明它可以正常工作。

score -1 · Accepted Answer

对数据进行聚类后，您是否为每个聚类训练分类器？如果您不这样做，那么这可能是您的问题。

尝试这样做。首先，对数据进行聚类并保留质心。然后，使用训练数据，为每个集群训练一个分类器。对于分类阶段，找到您要分类的对象的最近质心并使用相应的分类器。

单个分类器不是一个好主意，因为它会学习整个数据集的模式。但是，您在集群时想要的是学习描述每个集群的本地模式。

score -1 · Accepted Answer

考虑这个函数调用：

classify(cluster1, training_data, target_class, 'diaglinear');

training_data是整个特征空间的一个样本。那是什么意思？您正在训练的分类模型将尝试最大化整个特征空间的分类准确度。这意味着，如果您显示与训练数据具有相同行为的测试样本，您将获得分类结果。

关键是您没有显示与训练数据具有相同行为的测试样本。事实上，cluster1 只是您的特征空间的一个分区的样本。更具体地说，cluster1 中的实例对应于特征空间的样本，这些样本比其余的质心更接近 cluster1 的质心，这可能会降低分类器的性能。

所以我建议你以下：

聚类您的训练集并保持质心
使用训练数据，为每个集群训练一个分类器。也就是说，仅使用属于该集群的实例来训练分类器。
对于分类阶段，找到您要分类的对象的最近质心并使用相应的分类器。

matlab - 朴素贝叶斯分类器和判别分析的准确性还差得很远

5 回答 5

Related

Reference