I have very simple code using knn model to classify irises:
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
data = load_iris()
X = pd.DataFrame(data=data.data, columns=data.feature_names)
y = pd.DataFrame(data=data.target, columns=['class'])
df = pd.concat([X, y], axis=1)
df.head()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train.values.ravel())
print('Score =', knn.score(X_test, y_test))
iris_1 = [7.1, 3.3, 6.1, 1.5]
iris_2 = [5.3, 3.5, 1.6, 0.3]
iris_3 = [15.1, 13.7, 12.9, 11.3]
iris_4 = [0.1, 0.2, 0.4, 0.5]
irises = [iris_1, iris_2, iris_3, iris_4]
print(knn.predict(irises))
Model is doing very well, however I've found one issue. As you can see iris_3 and iris_4 are totally out of range, but model still is returning class 2 for iris_3 and class 0 for iris_4. They should be marked as an "UNKNOWN" or sth like that. I tried knnregression, but it returns something like 2.05 for iris_3, so it is not possible to round it to other class like in linear regression. My question is, is there any way to protect model from doing that? Or I should use this model only when I am sure, that incoming data is 100% valid, there won't be any strange values?