python-2.7 - 在python中为具有二进制类标签的模型选择阈值

Question

用例：为使用 statsmodel 的 Logit 构建的 Logistic 模型选择“最佳阈值”以预测二进制类（或多项式，但整数类）

要为 Python 中的（例如逻辑）模型选择阈值，是否有内置的东西？对于小型数据集，我记得，通过获取最大的真实预测标签（真实“0”和真实“1”）桶来优化“阈值”，从这里的图表中可以看出 - http://en。 wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test

我也直观地知道，如果我设置 alpha 值，它应该给我一个可以在下面使用的“阈值”。给定带有变量的简化模型，我应该如何计算阈值，所有这些变量在 95% 的置信度下都很重要？显然，将阈值设置为 >0.5 ->"1" 太天真了，因为我正在查看 95% 的置信度，所以这个阈值应该是 "smaller" ，这意味着 p >0.2 或其他东西。

如果标签应该是“1”和“0”，这将意味着类似于“临界值”的范围。

我想要的是这样的-：

test_scores = smf.Logit(y_train,x_train,missing='drop').fit()
threshold =0.2 
#test_scores.predict(x_train,transform=False) will give the continues probability class, so to transform it into labels, I need to compare it against a threshold, (or x_test if I am testing the model)
y_predicted_train = np.array(test_scores.predict(x_train,transform=False) > threshold, dtype=float)
table = np.histogram2d(y_train, y_predicted_train, bins=2)[0]
# will do the similar on "test" data


# crude way of selecting an optimal threshold
from scipy.stats import ks_2samp
import numpy as np
ks_2samp(y_train, y_predicted_train)
(0.39963996399639962, 0.958989) 
# must get <95 % here & keep modifying the threshold as above till I fail to reject the Null at 95%

# 其中 y_train 是 REAL 值 & y_predicted 回到 TRAIN 数据集。请注意，要获得 y_predicted（作为二进制文件，我已经按照上面的方法进行了阈值处理

问题：-

1.如何以客观的方式选择阈值 - 即减少错误分类标签的百分比（比如我更关心丢失“1”（真阳性），但如果我将“0”错误预测为“1”，则不是那么多（假阴性）并尝试减少此错误。这是我从 ROC 曲线中得到的。statsmodels(roc_curve) 中的 roc 曲线假定我已经为 y_predicted 类进行了标记，我只是在重新验证这个过度测试（如果我的理解是不正确的）。我也认为，使用混淆矩阵也不会解决拾取阈值问题

2.这让我想到 - 我应该如何使用这些内置函数（oob、confusion_matrix）的输出以适合选择最佳阈值（首先在火车样本上，然后在测试和交叉验证样本上对其进行微调）

我还在这里查看了 scipy 中 KS 测试的官方文档 - http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html#scipy.stats.kstest

相关 -: 使用 Python 和 Rpy2 进行统计测试（Kolmogorov 和 T 测试）

python-2.7 - 在python中为具有二进制类标签的模型选择阈值

0 回答 0

Related

Reference