python - Knn 预测在 y_test 上达到 100%

Question

我正在尝试在 Iris 数据集上实现 K-最近邻，但是在进行预测之后，yhat 100% 没有错误，一定有什么问题，我不知道它是什么......

我创建了一个名为 class_id 的列，我在其中进行了更改：

塞萨萨 = 1.0
杂色 = 2.0
弗吉尼亚 = 3.0

该列是浮点类型。

得到 X 和 Y


    x = df[['sepal length', 'sepal width', 'petal length', 'petal width']].values

type(x) 显示 nparray


    y = df['class_id'].values

type(y) 显示 nparray

规范化数据


    x = preprocessing.StandardScaler().fit(x).transform(x.astype(float))

创建训练和测试


    x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.2, random_state = 42)

检查最佳 K 值：


    Ks = 12
    for i in range(1,Ks):
       k = i
       neigh = KNeighborsClassifier(n_neighbors=k).fit(x_train,y_train)
       yhat = neigh.predict(x_test)
       score = metrics.accuracy_score(y_test,yhat)
       print('K: ', k, ' score: ', score, '\n')

结果：

K：1分：0.9666666666666667

K：2 得分：1.0

K：3 得分：1.0

K：4 得分：1.0

K：5 得分：1.0

K：6 得分：1.0

K：7 得分：1.0

K：8 得分：1.0

K：9 得分：1.0

K：10 得分：1.0

K：11 得分：1.0

打印 y_test 和 yhat WITH K = 5


    print(yhat)
    print(y_test)

结果：

哈特：[2。1. 3. 2. 2. 1. 2. 3. 2. 2. 3. 1. 1. 1. 1. 2. 3. 2. 2. 3. 1. 3. 1. 3. 3. 3. 3. 3. 1. 1.]

y_test：[2。1. 3. 2. 2. 1. 2. 3. 2. 2. 3. 1. 1. 1. 1. 2. 3. 2. 2. 3. 1. 3. 1. 3. 3. 3. 3. 3. 1. 1.]

所有这些都不应该是100％正确的，肯定有问题

score 0 · Accepted Answer

我通过技能muggler（用户）的解释找到了答案：

您正在使用 iris 数据集。这是一个经过良好清理和建模的数据集。这些特征与结果有很强的相关性，这导致 kNN 模型非常适合数据。要对此进行测试，您可以减小训练集的大小，这将导致准确性下降。

预测模型正确。

score 0 · Accepted Answer

尝试制作一个混淆矩阵。测试您的测试数据的每个示例，并检查特异性、敏感性、准确性和精确度的指标。

在哪里：

TN = True Negative
FN = False Negative
FP = False Positive
TP = True Positive

在这里您可以检查特异性和敏感性之间的区别 https://dzone.com/articles/ml-metrics-sensitivity-vs-specificity-difference

在这里，您有一个示例，说明如何使用 sklearn 在 python 中获得一个混淆矩阵。

还尝试制作 ROC 曲线（可选） https://en.wikipedia.org/wiki/Receiver_operating_characteristic

python - Knn 预测在 y_test 上达到 100%

得到 X 和 Y

规范化数据

创建训练和测试

检查最佳 K 值：

结果：

打印 y_test 和 yhat WITH K = 5

结果：

2 回答 2

Related

Reference