1

我已经构建了一个小程序,它使用 scikit-learn 为给定的数据集创建分类器。现在我想试试这个例子,看看分类器在工作。例如,clf 必须检测“猫”。

这就是我继续的方式:

我有 50 张猫的照片和 50 张“无猫”的照片。

  1. data_set使用筛选特征检测器获取描述符
  2. 将数据拆分为训练集和测试集(25张猫图片+25张非猫图片=training_set,test_set相同)
  3. training_set
  4. 使用聚类中心创建training_setan的直方图数据test_set
  5. 试试 scikit-learn 中的这段代码:

    tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]
    
    scores = ['precision', 'recall']
    
    for score in scores:
      print("# Tuning hyper-parameters for %s" % score)
      print()
    
      clf = GridSearchCV(SVC(C=1), tuned_parameters, cv=5, scoring=score)
      clf.fit(X_train, y_train)
    
      print("Best parameters set found on development set:")
      print()
      print(clf.best_estimator_)
      print()
      print("Grid scores on development set:")
      print()
      for params, mean_score, scores in clf.grid_scores_:
         print("%0.3f (+/-%0.03f) for %r"
              % (mean_score, scores.std() / 2, params))
      print()
      print("Detailed classification report:")
      print()
      print("The model is trained on the full development set.")
      print("The scores are computed on the full evaluation set.")
      print()
      y_true, y_pred = y_test, clf.predict(X_test)
      print y_true
      print y_pred
      print(classification_report(y_true, y_pred))
      print()
      print clf.score(X_train, y_train)
      print "score"
      print clf.best_params_
      print "best_params"
      pred = clf.predict(X_test)
      print accuracy_score(y_test, pred)
      print "accuracy_score"
    

我得到了这个结果:

# Tuning hyper-parameters for recall
()
/usr/local/lib/python2.7/dist-packages/sklearn/metrics/metrics.py:1760: UserWarning: The sum of true positives and false positives are equal to zero for some labels. Precision is ill defined for those labels [ 0.]. The precision and recall are equal to zero for some labels. fbeta_score is ill defined for those labels [ 0.]. 
  average=average)
/usr/local/lib/python2.7/dist-packages/sklearn/metrics/metrics.py:1760: UserWarning: The sum of true positives and false positives are equal to zero for some labels. Precision is ill defined for those labels [ 1.]. The precision and recall are equal to zero for some labels. fbeta_score is ill defined for those labels [ 1.]. 
  average=average)
Best parameters set found on development set:
()
SVC(C=0.001, cache_size=200, class_weight=None, coef0=0.0, degree=3,
  gamma=0.001, kernel=rbf, max_iter=-1, probability=False,
  random_state=None, shrinking=True, tol=0.001, verbose=False)
()
Grid scores on development set:
()
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.001, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.001, 'gamma': 0.0001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.01, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.01, 'gamma': 0.0001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.10000000000000001, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.10000000000000001, 'gamma': 0.0001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 1.0, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 1.0, 'gamma': 0.0001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 10.0, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 10.0, 'gamma': 0.0001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 100.0, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 100.0, 'gamma': 0.0001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 1000.0, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 1000.0, 'gamma': 0.0001}
()
Detailed classification report:
()
The model is trained on the full development set.
The scores are computed on the full evaluation set.
()
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  0.  1.  1.  1.  1.  1.]
             precision    recall  f1-score   support

        0.0       1.00      0.04      0.08        25
        1.0       0.51      1.00      0.68        25

avg / total       0.76      0.52      0.38        50

()
0.52
score
{'kernel': 'rbf', 'C': 0.001, 'gamma': 0.001}
best_params
0.52
accuracy_score

似乎是 clf 对所有人说的认为它是一只猫....但是为什么呢?

是不是data_set要小才能获得好结果?

编辑:我正在使用 VLFeat 来检测筛选描述符

功能:

def create_descriptor_data(data, ID):
    descriptor_list = []
    datas = numpy.genfromtxt(data,dtype='str')
    for p in datas:
      locs, desc = vlfeat_module.vlf_create_descriptors(p,str(ID)+'.key',ID) # create descriptors and save descs in file
      if len(desc) > 500:
        desc = desc[::round((len(desc))/400, 1)] # take between 400 - 800 descriptors
      descriptor_list.append(desc)
      ID += 1 # ID for filename
    return descriptor_list

# create k-mean centers from all *.txt files in directory (data)
def create_center_data(data):
    #data = numpy.vstack(data)
    n_clusters = len(numpy.unique(data))
    kmeans = KMeans(init='k-means++', n_clusters=n_clusters, n_init=1)
    kmeans.fit(data)
    return kmeans, n_clusters

def create_histogram_data(kmeans, descs, n_clusters):
    histogram_list = []
    # load from each file data
    for desc in descs:
      length = len(desc)
      # create histogram from descriptors
      histogram = kmeans.predict(desc)
      histogram = numpy.bincount(histogram, minlength=n_clusters) #minlength = k in k-means 
      histogram = numpy.divide(histogram, length, dtype='float')
      histogram_list.append(histogram)
    histogram = numpy.vstack(histogram_list)
    return histogram

和电话:

X_desc_pos = lib.dataset_module.create_descriptor_data("./static/picture_set/dataset_pos.txt",0) # create desc from dataset_pos, 25 pics
X_desc_neg = lib.dataset_module.create_descriptor_data("./static/picture_set/dataset_neg.txt",51) # create desc from dataset_neg, 25 pics

X_train_pos, X_test_pos = train_test_split(X_desc_pos, test_size=0.5)
X_train_neg, X_test_neg = train_test_split(X_desc_neg, test_size=0.5)

x1 = numpy.vstack(X_train_pos)
x2 = numpy.vstack(X_train_neg)
kmeans, n_clusters = lib.dataset_module.create_center_data(numpy.vstack((x1,x2)))

X_train_pos = lib.dataset_module.create_histogram_data(kmeans, X_train_pos, n_clusters)
X_train_neg = lib.dataset_module.create_histogram_data(kmeans, X_train_neg, n_clusters)

X_train = numpy.vstack([X_train_pos, X_train_neg])
y_train = numpy.hstack([numpy.ones(len(X_train_pos)), numpy.zeros(len(X_train_neg))])

X_test_pos = lib.dataset_module.create_histogram_data(kmeans, X_test_pos, n_clusters)
X_test_neg = lib.dataset_module.create_histogram_data(kmeans, X_test_neg, n_clusters)

X_test = numpy.vstack([X_test_pos, X_test_neg])
y_test = numpy.hstack([numpy.ones(len(X_test_pos)), numpy.zeros(len(X_test_neg))])

tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

scores = ['precision', 'recall']

for score in scores:
    print("# Tuning hyper-parameters for %s" % score)
    print()

    clf = GridSearchCV(SVC(C=1), tuned_parameters, cv=5, scoring=score)
    clf.fit(X_train, y_train)

    print("Best parameters set found on development set:")
    print()
    print(clf.best_estimator_)
    print()
    print("Grid scores on development set:")
    print()
    for params, mean_score, scores in clf.grid_scores_:
       print("%0.3f (+/-%0.03f) for %r"
              % (mean_score, scores.std() / 2, params))
    print()
    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    y_true, y_pred = y_test, clf.predict(X_test)
    print y_true
    print y_pred
    print(classification_report(y_true, y_pred))
    print()
    print clf.score(X_train, y_train)
    print "score"
    print clf.best_params_
    print "best_params"
    pred = clf.predict(X_test)
    print accuracy_score(y_test, pred)
    print "accuracy_score"

编辑:通过更新范围进行一些更改并再次保存“准确性”

# Tuning hyper-parameters for accuracy
()
Best parameters set found on development set:
()
SVC(C=1000.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
  gamma=1.0, kernel=rbf, max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
()
Grid scores on development set:
()
...
()
Detailed classification report:
()
The model is trained on the full development set.
The scores are computed on the full evaluation set.
()
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  0.  1.  0.  1.  1.  1.
  1.  1.  1.  0.  1.  1.  1.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.
  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.]
             precision    recall  f1-score   support

        0.0       0.88      0.92      0.90        25
        1.0       0.92      0.88      0.90        25

avg / total       0.90      0.90      0.90        50

()
1.0
score
{'kernel': 'rbf', 'C': 1000.0, 'gamma': 1.0}
best_params
0.9
accuracy_score

但是通过在图片上测试它

rslt = clf.predict(test_histogram)

他还在对沙发说:“你是只猫”:D

4

2 回答 2

3

似乎是 clf 对所有人说的认为它是一只猫....但是为什么呢?

从粘贴的输出中很难分辨出来,但似乎这是循环 over 的第二次迭代scores = ['precision', 'recall'],因此您正在优化召回。这与分类报告一致,该报告指出,1.00对于正类,召回是(完美的)。

什么时候回忆完美?好吧,当没有假阴性时,没有猫未被发现。因此,获得完美召回的简单方法是为每张输入图片预测“猫”,无论它是否是猫,并GridSearchCV找到一个可以做到这一点的分类器。

当您优化精度时可能会发生类似的事情:完美的精度可以通过从不预测“猫”来实现,因为您不会有误报。

为避免这种情况,请优化准确性而不是精度或召回率,或者如果您遇到类不平衡的情况,请优化 Fᵦ。

于 2013-08-13T14:28:34.803 回答
2

这种行为有很多可能性:

  • 创建训练/测试数据时出错[实施错误]
  • 20 个元素的训练集(25 个向量和 5 个交叉验证叶 20 个用于 Trianing)可能太小而无法进行良好的泛化[拟合不足]
  • 检查Cgamma参数的范围可能太窄 - 此变量高度依赖数据,您的表示值可能需要完全不同C的 ' 和gamma' 然后那些当前使用的[欠/过拟合]

我个人的猜测(因为没有数据很难重现问题)这里是第三种选择 - 不好的C参数gamma来找到一个好的模型。

编辑

您应该尝试更大范围的值,例如。

  • C10^-5和之间10^15
  • gamma10^-14和之间10^2

    C=[]
    gamma=[]
    for i in range(21): C.append(10.0**(i-5))
    for i in range(17): gamma.append(10**(i-14))
    

编辑2

一旦参数的范围得到纠正,现在您应该执行实际的“案例研究”。收集更多图像,分析您的数据表示(直方图真的足以完成这项任务吗?),处理您的数据(它已经标准化了吗?也许尝试一些去相关?),考虑使用更简单的内核 - rbf 可能非常具有欺骗性 - 一方面它可以在训练期间获得高分,但另一方面 - 在测试期间完全失败。这是其过拟合能力的结果(对于任何一致的数据集,RBF-SVM 在训练期间都可以达到 100% 的分数),因此在模型的能力和泛化能力之间找到平衡是一个难题。这是真正的“机器学习之旅”开始的时候,玩得开心!

于 2013-08-13T14:07:40.680 回答