4

在 SHAP TreeExplainer 中包含训练数据会expected_value在 scikit-learn GBT Regressor 中给出不同的结果。

可重现的示例(在 Google Colab 中运行):

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
import numpy as np
import shap

shap.__version__
# 0.37.0

X, y = make_regression(n_samples=1000, n_features=10, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

gbt = GradientBoostingRegressor(random_state=0)
gbt.fit(X_train, y_train)

# mean prediction:
mean_pred_gbt = np.mean(gbt.predict(X_train))
mean_pred_gbt
# -11.534353657511172

# explainer without data
gbt_explainer = shap.TreeExplainer(gbt)
gbt_explainer.expected_value
# array([-11.53435366])

np.isclose(mean_pred_gbt, gbt_explainer.expected_value)
# array([ True])

# explainer with training data
gbt_data_explainer = shap.TreeExplainer(model=gbt, data=X_train) # specifying feature_perturbation does not change the result
gbt_data_explainer.expected_value
# -23.564797322079635

因此,包含训练数据时的期望值与gbt_data_explainer.expected_value不提供数据时计算的期望值( )完全不同gbt_explainer.expected_value

当与(明显不同的)各自的 一起使用时,这两种方法都是相加且一致的shap_values

np.abs(gbt_explainer.expected_value + gbt_explainer.shap_values(X_train).sum(1) - gbt.predict(X_train)).max() < 1e-4
# True

np.abs(gbt_data_explainer.expected_value + gbt_data_explainer.shap_values(X_train).sum(1) - gbt.predict(X_train)).max() < 1e-4
# True

但我想知道为什么它们不提供相同的expected_value,以及为什么gbt_data_explainer.expected_value与预测的平均值如此不同。

我在这里想念什么?

4

2 回答 2

4

通过时显然shap子集为 100 行data,然后通过树运行这些行以重置每个节点的样本计数。所以-23.5...报告的是这 100 行的平均模型输出。

data传递给一个Independent掩码器,它进行子采样:
https://github.com/slundberg/shap/blob/v0.37.0/shap/explainers/_tree.py#L94
https://github.com/slundberg/shap /blob/v0.37.0/shap/explainers/_explainer.py#L68
https://github.com/slundberg/shap/blob/v0.37.0/shap/maskers/_tabular.py#L216

跑步

from shap import maskers

another_gbt_explainer = shap.TreeExplainer(
    gbt,
    data=maskers.Independent(X_train, max_samples=800),
    feature_perturbation="tree_path_dependent"
)
another_gbt_explainer.expected_value

回到

-11.534353657511172
于 2020-11-13T22:44:57.503 回答
2

尽管@Ben 在挖掘如何data通过掩码器方面做得很好Independent,但他的回答并没有准确显示(1)如何计算基值以及我们从哪里获得不同的基值以及(2)如何选择/降低max_samples参数_

不同的价值从何而来

masker 对象具有一个data属性,该属性在屏蔽过程之后保存数据。要获得显示的值gbt_explainer.expected_value

from shap.maskers import Independent
gbt = GradientBoostingRegressor(random_state=0)

# mean prediction:
mean_pred_gbt = np.mean(gbt.predict(X_train))
mean_pred_gbt
# -11.534353657511172

# explainer without data
gbt_explainer = shap.TreeExplainer(gbt)
gbt_explainer.expected_value
# array([-11.53435366])

gbt_explainer = shap.TreeExplainer(gbt, Independent(X_train,100))
gbt_explainer.expected_value
# -23.56479732207963

一个人需要做:

masker = Independent(X_train,100)
gbt.predict(masker.data).mean()
# -23.56479732207963

怎么选max_samples

设置max_samples为原始数据集长度似乎也适用于其他解释器:

import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import shap
from shap.maskers import Independent
from scipy.special import logit, expit

corpus,y = shap.datasets.imdb()
corpus_train, corpus_test, y_train, y_test = train_test_split(corpus, y, test_size=0.2, random_state=7)

vectorizer = TfidfVectorizer(min_df=10)
X_train = vectorizer.fit_transform(corpus_train)

model = sklearn.linear_model.LogisticRegression(penalty="l2", C=0.1)
model.fit(X_train, y_train)

explainer = shap.Explainer(model
                           ,masker = Independent(X_train,100)
                           ,feature_names=vectorizer.get_feature_names()
                          )
explainer.expected_value
# -0.18417413671991964

该值来自:

masker=Independent(X_train,100)
logit(model.predict_proba(masker.data.mean(0).reshape(1,-1))[...,1])
# array([-0.18417414])

max_samples=100truebase_value似乎有点偏离(仅提供功能意味着):

logit(model.predict_proba(X_train.mean(0).reshape(1,-1))[:,1])
array([-0.02938039])

通过增加max_samples一个可能会合理地接近true基线,同时保持样本数量较低:

masker = Independent(X_train,1000)
logit(model.predict_proba(masker.data.mean(0).reshape(1,-1))[:,1])
# -0.05957302658674238

因此,要获得感兴趣的解释器的基值(1)通过explainer.data(或masker.data)通过您的模型并(2)选择max_samples使采样数据上的 base_value 足够接近真实的基值。您也可以尝试观察形状重要性的值和顺序是否收敛。

有些人可能会注意到,为了获得基值,有时我们平均特征输入 ( LogisticRegression) 有时输出 ( GBT)

于 2020-11-21T13:46:07.557 回答