编辑:感谢您的澄清。我已经改变了答案。
重要的是要意识到您正在尝试将多维空间中的选择投影到一维空间中。并非在每种情况下,您都能看到像您得到的那样清晰的分离。这样做也有多种可能性,这里我做了一个简单的例子,可以帮助您的客户解释模型,但当然并不代表模型的全部复杂性。
你没有提供任何样本数据,所以我将从乳腺癌数据集中生成一些。
首先让我们导入我们需要的东西:
from sklearn import datasets
from xgboost import XGBClassifier
import pandas as pd
import numpy as np
现在导入数据集并训练一个非常简单的 XGBoost 模型
cancer = datasets.load_breast_cancer()
X = cancer.data
y = cancer.target
xgb_model = XGBClassifier(n_estimators=5,
objective="binary:logistic",
random_state=42)
xgb_model.fit(X, y)
y_prob = pd.DataFrame(xgb_model.predict_proba(X))[0]
有多种方法可以解决这个问题。
一种方法是对模型给出的概率进行分类。因此,您将决定您认为哪些概率是“高风险”、“中风险”和“低风险”,并且可以对数据区间进行分类。在这个例子中,我认为是低0 <= p <= 0.5
,中0.5 < p <= 0.8
,高0.8 < p <= 1
。
首先,您必须计算每个预测的概率。我建议可能为此使用测试集,以避免可能的模型过度拟合产生偏差。
y_prob = pd.DataFrame(xgb_model.predict_proba(X))[0]
df = pd.DataFrame(X, columns=cancer.feature_names)
# Stores the probability of a malignant cancer
df['probability'] = y_prob
然后,您必须对数据进行分箱并计算每个分箱的平均概率。我建议使用np.histogram_bin_edges自动计算对您的数据进行分类:
def calculate_mean_prob(feat):
"""Calculates mean probability for a feature value, binning it."""
# Bins from the automatic rules from numpy, check docs for details
bins = np.histogram_bin_edges(df[feat], bins='auto')
binned_values = pd.cut(df[feat], bins)
return df['probability'].groupby(binned_values).mean()
现在您可以按照您认为的低/中/高概率对每个 bin 进行分类:
def classify_probability(prob, medium=0.5, high=0.8, fillna_method= 'ffill'):
"""Classify the output of each bin into a risk group,
according to the probability.
Following the follow rules:
0 <= p <= medium: Low risk
medium < p <= high: Medium risk
high < p <= 1: High Risk
If a bin has no entries, it will be filled using fillna with the method
specified in fillna_method
"""
risk = pd.cut(prob, [0., medium, high, 1.0], include_lowest=True,
labels=['Low Risk', 'Medium Risk', 'High Risk'])
risk.fillna(method=fillna_method, inplace=True)
return risk
这将返回您划分数据的每个 bin 的风险。由于您可能会有多个具有连续值的 bin,因此您可能希望合并连续的 pd.Interval bin。代码如下所示:
def sum_interval(i1, i2):
if i2 is None:
return None
if i1.right == i2.left:
return pd.Interval(i1.left, i2.right)
return None
def sum_intervals(args):
"""Given a list of pd.Intervals,
returns a list summing consecutive intervals."""
result = list()
current_interval = args[0]
for next_interval in list(args[1:]) + [None]:
# Try to sum the current interval and nex interval
# The None in necessary for the last interval
sum_int = sum_interval(current_interval, next_interval)
if sum_int is not None:
# Update the current_interval in case if it is
# possible to sum
current_interval = sum_int
else:
# Otherwise tries to start a new interval
result.append(current_interval)
current_interval = next_interval
if len(result) == 1:
return result[0]
return result
def combine_bins(df):
# Group them by label
grouped = df.groupby(df).apply(lambda x: sorted(list(x.index)))
# Sum each category in intervals, if consecutive
merged_intervals = grouped.apply(sum_intervals)
return merged_intervals
现在您可以结合所有函数来计算每个特征的 bin:
def generate_risk_class(feature, medium=0.5, high=0.8):
mean_prob = calculate_mean_prob(feature)
classification = classify_probability(mean_prob, medium=medium, high=high)
merged_bins = combine_bins(classification)
return merged_bins
例如,generate_risk_class('worst radius')
结果:
Low Risk (7.93, 17.3]
Medium Risk (17.3, 18.639]
High Risk (18.639, 36.04]
但是如果你得到的特征不是很好的鉴别器(或者没有线性区分高/低风险),你将有更复杂的区域。例如generate_risk_class('mean symmetry')
结果:
Low Risk [(0.114, 0.209], (0.241, 0.249], (0.272, 0.288]]
Medium Risk [(0.209, 0.225], (0.233, 0.241], (0.249, 0.264]]
High Risk [(0.225, 0.233], (0.264, 0.272], (0.288, 0.304]]