python - 所有未见数据的概率值均小于 0.5

Question

我有 15 个带有二进制响应变量的特征，我对预测 0 或 1 个类别标签的概率感兴趣。当我在数据框中使用 500 棵树、CV、平衡类权重和平衡样本训练和测试 RF 模型时，我获得了很好的准确度和很好的 Brier 分数。正如您在图像中看到的，测试数据上第 1 类的预测概率值介于 0 到 1 之间。

这是测试数据的预测概率直方图：

大多数值在 0 - 0.2 和 0.9 到 1 之间，这是非常准确的。但是，当我尝试预测未见数据的概率值，或者假设所有值为 0 或 1 的数据点未知时，预测的概率值仅在第 1 类中介于 0 到 0.5 之间。为什么会这样？值不应该是从 0.5 到 1 吗？

这是未见数据的预测概率直方图：

我在 python 中使用 sklearn RandomforestClassifier。代码如下：

#Read the CSV
df=pd.read_csv('path/df_all.csv')

#Change the type of the variable as needed
df=df.astype({'probabilities': 'int32', 'CPZ_CI_new.tif' : 'category'})

#Response variable is between 0 and 1 having actual probabilities values
y = df['probabilities']

# Separate majority and minority classes
df_majority = df[y == 0]
df_minority = df[y == 1]

# Upsample minority class
df_minority_upsampled = resample(df_minority,
                                 replace=True,  # sample with replacement
                                 n_samples=100387,  # to match majority class
                                 random_state=42)  # reproducible results

# Combine majority class with upsampled minority class
df1 = pd.concat([df_majority, df_minority_upsampled])

y = df1['probabilities']
X = df1.iloc[:,1:138]

#Change interfere values to category
y_01=y.astype('category')

#Split training and testing
X_train, X_valid, y_train, y_valid = train_test_split(X, y_01, test_size = 0.30, random_state = 42,stratify=y)

#Model

model=RandomForestClassifier(n_estimators = 500,
                           max_features= 'sqrt',
                           n_jobs = -1,
                           oob_score = True,
                           bootstrap = True,
                           random_state=0,class_weight='balanced',)
#I had 137 variable, to select the optimum one, I used RFECV
rfecv = RFECV(model, step=1, min_features_to_select=1, cv=10, scoring='neg_brier_score')
rfecv.fit(X_train, y_train)

#Retrained the model with only 15 variables selected
rf=RandomForestClassifier(n_estimators = 500,
                           max_features= 'sqrt',
                           n_jobs = -1,
                           oob_score = True,
                           bootstrap = True,
                           random_state=0,class_weight='balanced',)

#X1_train is same dataframe with but with only 15 varible 
rf.fit(X1_train,y_train)

#Printed ROC metric
print('roc_auc_score_testing:', metrics.roc_auc_score(y_valid,rf.predict(X1_valid)))

#Predicted probabilties on test data
predv=rf.predict_proba(X1_valid)
predv = predv[:, 1]
print('brier_score_training:', metrics.brier_score_loss(y_train, predt))
print('brier_score_testing:', metrics.brier_score_loss(y_valid, predv))

#Output is,
roc_auc_score_testing: 0.9832652130944419
brier_score_training: 0.002380976369884945
brier_score_testing: 0.01669848089917487

#Later, I have images of that 15 variables, I created a data frame out(sample_img) of it and use the same function to predict probabilities. 

IMG_pred=rf.predict_proba(sample_img)
IMG_pred=IMG_pred[:,1]

score 1 · Accepted Answer

您的测试数据显示的结果无效；您执行了一个错误的程序，该程序会产生两个严重的后果，从而使它们无效。

这里的错误是您在拆分到训练集和测试集之前执行了少数类上采样，这不应该是这种情况；您应该首先分成训练集和测试集，然后只对训练数据执行上采样，而不是对测试数据执行上采样。

这种过程无效的第一个原因是，这样一来，由于上采样导致的一些重复将最终进入训练和测试分裂；结果是该算法使用了一些在训练期间已经看到的样本进行了测试，这使测试集的基本要求无效。有关更多详细信息，请参阅Process for oversampling data for不平衡二元分类中的自己的答案；从那里引用：

我曾经目睹过一个案例，建模者很难理解为什么他的测试准确率能达到 100%，远高于他的训练准确率；原来他的初始数据集充满了重复——这里没有类不平衡，但想法是相似的——其中一些重复在分割后自然而然地最终出现在他的测试集中，当然不是新的或看不见的数据......

第二个原因是这个过程在不再代表现实的测试集中显示了有偏差的性能度量：记住，我们希望我们的测试集代表真实的看不见的数据，这当然会不平衡；人为地平衡我们的测试集并声称它具有 X% 的准确度，而该准确度的很大一部分将归因于人为上采样的少数类是没有意义的，并且会产生误导性的印象。有关详细信息，请参阅交叉验证中平衡类中的自己的答案（对于训练测试拆分的情况，基本原理相同，如此处）。

第二个原因是即使您没有执行第一个错误，您的程序仍然会出错，并且您在拆分后分别对训练集和测试集进行了上采样。

简而言之，您应该修正该程序，以便您首先拆分为训练集和测试集，然后仅对您的训练集进行上采样。

python - 所有未见数据的概率值均小于 0.5

1 回答 1

Related

Reference