python - 为什么我在 scikit-learner 中使用 svm 模型识别 MNIST 中的手写图形时，结果不理想？

Question

我使用svm模型scikit-learner来预测 MNIST 中的笔迹。

但是，我得到的结果很混乱。当我使用经过训练的模型来预测已经在学习过程中使用的训练集时，准确率是 100%

在处理测试数据时，我只得到了大约 11% 的准确率。

除了过度拟合，我找不到原因。过拟合对结果有这么大的影响吗？

# coding:utf-8
from numpy import *
from sklearn import svm
from sklearn.externals import joblib
def loadData(fileName):
    fr = open(fileName)
    numFeat = len(fr.readline().split(',')) - 1       
    featMatTrain = []                                        
    labelVecTrain = []                                      
    featMatTest = []                                       
    labelVecTest = []                                     
    i = 0
    for line in fr.readlines():
        i = i + 1
        if i != 1 and i <=30000:
             curLine = line.strip().split(',')        
             curLine = map(float,curLine)              
             labelVecTrain.append(curLine[0])           
             featMatTrain.append(curLine[1:numFeat])       
        if i >= 30000:
             curLine = line.strip().split(',')      
             curLine = map(float,curLine)              
             labelVecTest.append(curLine[0])              
             featMatTest.append(curLine[1:numFeat])      
    print '*************************** the training data we got: *****************************'
    print 'featMat:''type of element:',type(featMatTrain) ,'shape of featMat:', shape(featMatTrain)
    print 'labelVec:''type of element:',type(labelVecTrain),'shape of labelVec:',shape(labelVecTrain)
    print 'featMat:''type of element:',type(featMatTest) ,'shape of featMat:', shape(featMatTest)
    print 'labelVec:''type of element:',type(labelVecTest),'shape of labelVec:',shape(labelVecTest)
    return array(featMatTrain),array(labelVecTrain),array(featMatTest),array(labelVecTest)

featMatTrain,labelVecTrain,featMatTest,labelVecTest= loadData('C:/Users/sun/Desktop/train.csv')    
clf = svm.SVC()                                                  
clf.fit(featMatTrain,labelVecTrain)                                           
joblib.dump(clf,'svmModel.pkl')                                     
print '***************** we finish training **********************'
labelVecPredict1 = clf.predict(featMatTrain)
labelVecPredict2 = clf.predict(featMatTest)
print '***************** we finish predicting **********************'
count1 = 0.0
for i in range(len(featMatTrain)):
    if labelVecPredict1[i] == labelVecTrain[i]:
        count1 = count1 + 1
print '************* the result of predicting training set ***************'
print 'the number of figures that predict right: ',count1
print 'the accuary is :',count1/len(featMatTrain)
count2 = 0.0
for i in range(len(featMatTest)):
    if labelVecPredict2[i] == labelVecTest[i]:
        count2 = count2 + 1
print '************ the result to predicting testing set ************'
print 'the number of figures that predict right:',count2
print 'the  accuary is:',count2/len(featMatTest)

score 0 · Accepted Answer

有一些原因会导致模型过拟合。

你在小数据集上使用了太强大的模型。也许你可以尝试一些线性模型。
您的训练数据集太小无法训练，因此您可以从测试数据集中添加一些训练数据。

你如何划分 MINST 数据集？您可能对数据集进行了不平衡削减。

score 0 · Accepted Answer

过度拟合绝对对结果有那么大的影响。SVC 是“强学习器”，这意味着具有足够的特征它可以过拟合任何数据集（其他强学习器包括决策树和 NearestNeighbor 模型）。

要解决这个问题，请使用更简单的模型或使用模型平均。更简单的模型包括 LinearSVC；模型平均包括 BaggingClassifier 和 RandomForestClassifier。

python - 为什么我在 scikit-learner 中使用 svm 模型识别 MNIST 中的手写图形时，结果不理想？

2 回答 2

Related

Reference