python - SHAP：shap_values 计算中的 XGBoost 和 LightGBM 差异

Question

我在 Visual Studio 代码中有这个代码：

import pandas as pd
import numpy as np
import shap
import matplotlib.pyplot as plt
import xgboost as xgb 
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate, cross_val_score
from sklearn.metrics import classification_report, ConfusionMatrixDisplay, accuracy_score

df = pd.read_csv("./mydataset.csv")
target=df.pop('target')
X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.2, random_state=22)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=22)

xgb_model = xgb.XGBClassifier(eval_metric='mlogloss',use_label_encoder =False)
xgb_fitted = xgb_model.fit(X_train, y_train)

explainer = shap.TreeExplainer(xgb_fitted)
shap_values = explainer.shap_values(X_test)

shap.summary_plot(shap_values[1], X_test)
shap.summary_plot(shap_values[1], X_test, plot_type="bar")

当我运行此代码时，我收到此错误：

Summary plots need a matrix of shap_values, not a vector.

上shap.summary_plot线。

有什么问题，我该如何解决？

以上代码基于此代码示例：https ://github.com/slundberg/shap 。

数据集如下：

Cat1,Cat2,Age,Cat3,Cat4,target
0,0,18,1,0,1
0,0,17,1,0,1
0,0,15,1,1,1
0,0,15,1,0,1
0,0,16,1,0,1
0,1,16,1,1,1
0,1,16,1,1,1
0,0,17,1,0,1
0,1,15,1,1,1
0,1,15,1,0,1
0,0,15,1,0,1
0,0,15,1,0,1
0,1,15,1,1,1
0,1,15,1,0,1
0,1,15,1,0,1
0,0,16,1,0,1
0,0,16,1,0,1
0,0,16,1,0,1
0,1,17,1,0,0
0,1,16,1,1,1
0,1,15,1,0,1
0,1,15,1,0,1
0,1,16,1,1,1
0,1,16,1,1,1
0,0,15,0,0,1
0,0,16,1,0,1
0,1,15,1,0,1

请注意，实际数据有 700 行，但我复制了其中的一小部分只是为了展示数据的外观。

编辑 1

这个问题的主要原因是要了解在使用不同的分类器时应该如何更改代码。

我最初有一个带有 lgmb 的示例代码，但当我将其更改为 xgboost 时，它会在摘要图上生成错误。

为了说明我的意思，我开发了以下示例代码：

import pandas as pd
import shap
import lightgbm as lgb
import xgboost as xgb 
from sklearn.model_selection import train_test_split

df = pd.read_csv("./mydataset.csv")
target=df.pop('target')
X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.2, random_state=22)

# select one of the two models
model = xgb.XGBClassifier()
#model = lgb.LGBMClassifier()
model_fitted = model.fit(X_train, y_train)

explainer = shap.Explainer(model_fitted)
shap_values = explainer.shap_values(X_test)

shap.summary_plot(shap_values[1], X_test)
shap.summary_plot(shap_values[1], X_test, plot_type="bar")

如果我使用 LGBM 模型，它运行良好，如果我使用 XGBoost，它会失败。有什么区别以及我应该如何更改 XGBoost 行为类似于 LGBM 和应用程序工作的代码。

score 0 · Accepted Answer

假设您已从上述问题中复制了数据，则可以执行以下操作：

import pandas as pd
import numpy as np
import shap
import matplotlib.pyplot as plt
import xgboost as xgb
from sklearn.model_selection import (
    train_test_split,
    StratifiedKFold,
    cross_validate,
    cross_val_score,
)
from sklearn.metrics import (
    classification_report,
    ConfusionMatrixDisplay,
    accuracy_score,
)

df = pd.read_clipboard(sep=",")

target = df.pop("target")
X_train, X_test, y_train, y_test = train_test_split(
    df, target, test_size=0.2, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42
)


xgb_model = xgb.XGBClassifier(eval_metric="mlogloss", use_label_encoder=False)
xgb_fitted = xgb_model.fit(X_train, y_train)

explainer = shap.TreeExplainer(xgb_fitted)
shap_values = explainer.shap_values(X_test)

shap.summary_plot(shap_values, X_test)
# shap.summary_plot(shap_values, X_test, plot_type="bar")

您粘贴的代码假定每个类“0”和“1”有 2 个 ["identical"] shap 值数组。自打印以来，explainer.shap_values计算 SHAP 值的方式发生了一些变化。XGBoost所以，现在提供shap_values（没有类索引）就足够了。

score 0 · Accepted Answer

请注意，summary_plot()您想要可视化通常哪些特征对模型更重要，因此它需要一个矩阵

对于单输出解释，这是一个 SHAP 值矩阵（# 个样本 x # 个特征）。

结果shap_values = explainer.shap_values(X_test)是一个形状矩阵(n_samples, 5)（样本数据中的列）。

当您获取第一个样本shap_values[0]是解释第一个预测特征贡献的向量时，这就是Summary plots need a matrix of shap_values, not a vector.提高的原因。

如果您想可视化单个预测shap_values[0]，您可以使用force_plot

shap.initjs()
shap.force_plot(explainer.expected_value, shap_values[0])

编辑

两个模型的输出之间的差异在于out结果的计算方式。lightgbm计算变量后检查源代码进行phi计算，它以下列方式连接值

phi = np.concatenate((0-phi, phi), axis=-1)

生成一个 shape 数组(n_samples, n_features*2)。

这个形状与不同X_test，即phi.shape[1] != X.shape[1] + 1，所以它把它重塑为一个 3 维数组

phi = phi.reshape(X.shape[0], phi.shape[1]//(X.shape[1]+1), X.shape[1]+1)

最后输出是一个长度为 2 的列表

out = [phi[:, i, :-1] for i in range(phi.shape[1])]
out
>>>
[array([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        ...
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]]),
 array([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        ...  
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]])]

请参阅下面的示例以了解out计算有何不同。

示例`LightGBM`

import pandas as pd
import numpy as np
import shap
import lightgbm as lgb
import xgboost as xgb 
import shap.explainers as explainers
from sklearn.model_selection import train_test_split

df = pd.read_csv("test_data.csv")
target=df.pop('target')

X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.5, random_state=0)

model = lgb.LGBMClassifier()
model_fitted = model.fit(X_train, y_train)
explainer = shap.TreeExplainer(model_fitted)

# Calculate phi from https://github.com/slundberg/shap/blob/46b3800b31df04745416da27c71b216f91d61775/shap/explainers/_tree.py#L347
tree_limit = -1 if explainer.model.tree_limit is None else explainer.model.tree_limit
phi = explainer.model.original_model.predict(X_test, num_iteration=tree_limit, pred_contrib=True)

# Objective is binary: https://github.com/slundberg/shap/blob/46b3800b31df04745416da27c71b216f91d61775/shap/explainers/_tree.py#L349
if explainer.model.original_model.params['objective'] == 'binary':
    phi = np.concatenate((0-phi, phi), axis=-1)

# Phi shape is different from X_test:
if phi.shape[1] != X_test.shape[1] + 1:
    phi = phi.reshape(X_test.shape[0], phi.shape[1]//(X_test.shape[1]+1), X_test.shape[1]+1)

# Return out: https://github.com/slundberg/shap/blob/46b3800b31df04745416da27c71b216f91d61775/shap/explainers/_tree.py#L370
expected_value = [phi[0, i, -1] for i in range(phi.shape[1])]
out = [phi[:, i, :-1] for i in range(phi.shape[1])]
expected_value
>>> [-0.8109302162163288, 0.8109302162163288]
out
>>> 
[array([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]]),
 array([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]])]

XGBoost 示例

import pandas as pd
import numpy as np
import shap
import lightgbm as lgb
import xgboost as xgb 
import shap.explainers as explainers
from sklearn.model_selection import train_test_split

df = pd.read_csv("test_data.csv")
target=df.pop('target')

X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.5, random_state=0)

model = xgb.XGBClassifier()
model_fitted = model.fit(X_train, y_train)
explainer = shap.TreeExplainer(model_fitted)

# Transform data to DMatrix: https://github.com/slundberg/shap/blob/46b3800b31df04745416da27c71b216f91d61775/shap/explainers/_tree.py#L326
if not isinstance(X_test, xgb.core.DMatrix):
    X_test = xgb.DMatrix(X_test)

tree_limit = explainer.model.tree_limit

# Calculate phi: https://github.com/slundberg/shap/blob/46b3800b31df04745416da27c71b216f91d61775/shap/explainers/_tree.py#L331
phi = explainer.model.original_model.predict(
    X_test, ntree_limit=tree_limit, pred_contribs=True,
    approx_contribs=False, validate_features=False
)

# Model output is "raw": https://github.com/slundberg/shap/blob/46b3800b31df04745416da27c71b216f91d61775/shap/explainers/_tree.py#L339
model_output_vals = explainer.model.original_model.predict(
    X_test, ntree_limit=tree_limit, output_margin=True,
    validate_features=False
)
model_output_vals
>>> array([-0.11323176, -0.11323176,  0.5436669 ,  0.87637275,  1.5332711 ,
       -0.11323176,  1.5332711 ,  0.5436669 ,  1.5332711 ,  0.5436669 ,
        0.87637275,  0.87637275, -0.11323176,  0.5436669 ], dtype=float32)

# Return out: https://github.com/slundberg/shap/blob/46b3800b31df04745416da27c71b216f91d61775/shap/explainers/_tree.py#L374
expected_value_ = phi[0, -1]
expected_value_
>>> 0.817982
out_ = phi[:, :-1]
out_
>>>
array([[ 0.        , -0.35038763, -0.5808259 ,  0.        ,  0.        ],
       [ 0.        , -0.35038763, -0.5808259 ,  0.        ,  0.        ],
       [ 0.        ,  0.3065111 , -0.5808259 ,  0.        ,  0.        ],
       [ 0.        , -0.35038763,  0.4087782 ,  0.        ,  0.        ],
       [ 0.        ,  0.3065111 ,  0.4087782 ,  0.        ,  0.        ],
       [ 0.        , -0.35038763, -0.5808259 ,  0.        ,  0.        ],
       [ 0.        ,  0.3065111 ,  0.4087782 ,  0.        ,  0.        ],
       [ 0.        ,  0.3065111 , -0.5808259 ,  0.        ,  0.        ],
       [ 0.        ,  0.3065111 ,  0.4087782 ,  0.        ,  0.        ],
       [ 0.        ,  0.3065111 , -0.5808259 ,  0.        ,  0.        ],
       [ 0.        , -0.35038763,  0.4087782 ,  0.        ,  0.        ],
       [ 0.        , -0.35038763,  0.4087782 ,  0.        ,  0.        ],
       [ 0.        , -0.35038763, -0.5808259 ,  0.        ,  0.        ],
       [ 0.        ,  0.3065111 , -0.5808259 ,  0.        ,  0.        ]],
      dtype=float32)

python - SHAP：shap_values 计算中的 XGBoost 和 LightGBM 差异

编辑 1

2 回答 2

编辑

示例LightGBM

XGBoost 示例

Related

Reference

示例`LightGBM`