0

我已经为多类实现了 Naive Bayse 分类器,但问题是我的错误率是相同的,而我增加了训练数据集。我一直在调试它,但无法弄清楚它为什么会发生。所以我想我会把它贴在这里,看看我是否做错了什么。

%Naive Bayse Classifier
%This function split data to 80:20 as data and test, then from 80
%We use incremental 5,10,15,20,30 as the test data to understand the error
%rate. 
%Goal is to compare the plots in stanford paper
%http://ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf

function[tPercent] = naivebayes(file, iter, percent)
dm = load(file);
    for i=1:iter

        %Getting the index common to test and train data
        idx = randperm(size(dm.data,1))

        %Using same idx for data and labels
        shuffledMatrix_data = dm.data(idx,:);
        shuffledMatrix_label = dm.labels(idx,:);

        percent_data_80 = round((0.8) * length(shuffledMatrix_data));


        %Doing 80-20 split
        train = shuffledMatrix_data(1:percent_data_80,:);

        test = shuffledMatrix_data(percent_data_80+1:length(shuffledMatrix_data),:);

        %Getting the label data from the 80:20 split
        train_labels = shuffledMatrix_label(1:percent_data_80,:);

        test_labels = shuffledMatrix_label(percent_data_80+1:length(shuffledMatrix_data),:);

        %Getting the array of percents [5 10 15..]
        percent_tracker = zeros(length(percent), 2);

        for pRows = 1:length(percent)

            percentOfRows = round((percent(pRows)/100) * length(train));
            new_train = train(1:percentOfRows,:);
            new_train_label = train_labels(1:percentOfRows);

            %get unique labels in training
            numClasses = size(unique(new_train_label),1);
            classMean = zeros(numClasses,size(new_train,2));
            classStd = zeros(numClasses, size(new_train,2));
            priorClass = zeros(numClasses, size(2,1));

            % Doing the K class mean and std with prior
            for kclass=1:numClasses
                classMean(kclass,:) = mean(new_train(new_train_label == kclass,:));
                classStd(kclass, :) = std(new_train(new_train_label == kclass,:));
                priorClass(kclass, :) = length(new_train(new_train_label == kclass))/length(new_train);
            end

            error = 0;

            p = zeros(numClasses,1);

            % Calculating the posterior for each test row for each k class
            for testRow=1:length(test)
                c=0; k=0;
                for class=1:numClasses
                    temp_p = normpdf(test(testRow,:),classMean(class,:), classStd(class,:));
                    p(class, 1) = sum(log(temp_p)) + (log(priorClass(class)));
                end
                %Take the max of posterior 
                [c,k] = max(p(1,:));
                if test_labels(testRow) ~= k
                    error = error +  1;
                end
            end
            avgError = error/length(test);
            percent_tracker(pRows,:) = [avgError percent(pRows)];
            tPercent = percent_tracker;
            plot(percent_tracker)
        end
    end
end

这是我的数据的维度

x = 

      data: [768x8 double]
    labels: [768x1 double]

我正在使用来自 UCI 的 Pima 数据集

4

1 回答 1

2

您实施训练数据本身的结果是什么?它完全适合吗?

很难确定,但我注意到了几件事:

  1. 每个班级都有训练数据很重要。如果没有训练数据,你就不能真正训练分类器来识别一个类。
  2. 如果可能的话,不应将训练示例的数量偏向某些类。例如,如果在 2 类分类中,第 1 类的训练和交叉验证示例的数量仅占数据的 5%,则始终返回第 2 类的函数将有 5% 的误差。您是否尝试过分别检查精度和召回率?
  3. 您正在尝试将正态分布拟合到类中的每个特征,然后将其用于后验概率。我不确定它在平滑方面的表现如何。你能尝试用简单的计数重新实现它,看看它是否会给出不同的结果吗?
  4. 也可能是特征高度冗余,贝叶斯方法高估了概率。
于 2012-10-07T09:45:47.137 回答