1

I'm having problems in understanding how K-NN classification works in MATLAB.´ Here's the problem, I have a large dataset (65 features for over 1500 subjects) and its respective classes' label (0 or 1). According to what's been explained to me, I have to divide the data into training, test and validation subsets to perform supervised training on the data, and classify it via K-NN. First of all, what's the best ratio to divide the 3 subgroups (1/3 of the size of the dataset each?).

I've looked into ClassificationKNN/fitcknn functions, as well as the crossval function (idealy to divide data), but I'm really not sure how to use them.

To sum up, I wanted to - divide data into 3 groups - "train" the KNN (I know it's not a method that requires training, but the equivalent to training) with the training subset - classify the test subset and get it's classification error/performance - what's the point of having a validation test?

I hope you can help me, thank you in advance

EDIT: I think I was able to do it, but, if that's not asking too much, could you see if I missed something? This is my code, for a random case:

nfeats=60;ninds=1000;
trainRatio=0.8;valRatio=.1;testRatio=.1;
kmax=100; %for instance...
data=randi(100,nfeats,ninds);
class=randi(2,1,ninds);
[trainInd,valInd,testInd] = dividerand(1000,trainRatio,valRatio,testRatio);
train=data(:,trainInd);
test=data(:,testInd);
val=data(:,valInd);
train_class=class(:,trainInd);
test_class=class(:,testInd);
val_class=class(:,valInd);
precisionmax=0;
koptimal=0;
for know=1:kmax
      %is it the same thing use knnclassify or fitcknn+predict??
      predicted_class = knnclassify(val', train', train_class',know);
      mdl = fitcknn(train',train_class','NumNeighbors',know) ;
      label = predict(mdl,val');
      consistency=sum(label==val_class')/length(val_class);
      if consistency>precisionmax
          precisionmax=consistency;
          koptimal=know;
      end
  end
mdl_final = fitcknn(train',train_class','NumNeighbors',know) ;
label_final = predict(mdl,test');
consistency_final=sum(label==test_class')/length(test_class);

Thank you very much for all your help

4

1 回答 1

2

对于您的第一个问题 “划分 3 个子组的最佳比例是多少”,只有经验法则:

  1. 训练数据的数量是最重要的。越多越好。因此,使其尽可能大,并且绝对大于测试或验证数据。

  2. 测试和验证数据具有相似的功能,因此可以方便地为它们分配相同数量的数据。但重要的是要有足够的数据来识别过度适应。因此,它们应该完全随机地从数据基础中挑选出来。

因此,50/25/25 或 60/20/20 分区很常见。但是,如果您的数据总量与您选择的拓扑的权重总数相比很小(例如,您的网络中有 10 个权重,数据中只有 200 个案例),那么 70/15/15 甚至 80/10/10可能是更好的选择。

关于您的第二个问题 “进行验证测试有什么意义?”

通常,您在训练数据上训练所选模型,然后通过将训练模型应用于未见数据(验证集)来估计“成功”。

如果您现在完全停止提高准确性的努力,那么您确实不需要数据的三个分区。但通常情况下,你觉得你可以通过改变权重或隐藏层的数量来提高模型的成功率,或者......现在一个大循环开始运行许多迭代:

1) 更改权重和拓扑,2) 训练,3) 验证,不满意,转到 1)

这个循环的长期影响是,您越来越多地使模型适应验证数据,因此结果变得更好不是因为您如此聪明地改进了拓扑,而是因为您无意识地学习了验证集的属性以及如何应对他们。

现在,你的神经网络的最终和唯一有效的准确度是在真正看不见的数据上估计的:测试集。这仅进行一次,并且对于揭示过度适应也很有用。现在不允许您开始第二个更大的循环来禁止对测试集的任何适应!

于 2014-07-10T17:58:46.900 回答