1

我已经为二进制分类安装了 XGBoost 模型。我试图了解拟合模型并尝试使用SHAP来解释预测。

但是,我对 SHAP 生成的力图感到困惑。我预计输出值应该小于 0,因为预测概率小于 0.5。但是,SHAP 值显示8.12

下面是我生成结果的代码。

import shap
import xgboost as xgb
import json
from scipy.sparse import load_npz

print('Version of SHAP: {}'.format(shap.__version__))
print('Version of XGBoost: {}'.format(xgb.__version__))

SHAP 版本:0.39.0

XGBoost 版本:1.4.1

# Read the data
X = load_npz('test_data.npz')
X_dmatrix = xgb.DMatrix(X)

# Read the selected features
with open('feature_list.json', 'r') as file:
    feature_list = json.load(file)
    
feature_names = [f'Feature {x:04d}' for x in range(len(feature_list))]

# Read the XGBoost model
xgb_model = xgb.Booster()
xgb_model.load_model('xgboost.json')

# Model prediction

model_pred_detail = xgb_model.predict(X_dmatrix, pred_contribs=True)
model_pred_prob = xgb_model.predict(X_dmatrix)
model_pred_detail.shape

(7887, 501)

# Random select a case
xid=4549
print('Predict proba: {:.04f}'.format(model_pred_prob[xid]))

预测概率:0.2292

# Doing SHAP way (https://github.com/slundberg/shap)
explainer = shap.Explainer(xgb_model, feature_names=feature_names, algorithm='tree')
shap_values = explainer(X.toarray())

shap.plots.force(shap_values[xid])

在此处输入图像描述

但是,如果我使用 XGBoost 库中的 SHAP 值,我会得到另一个图,这看起来与我的预期相似。

shap.force_plot(
    model_pred_detail[xid, -1], # From XGBoost.Booster.predict with pred_contribs=True
    model_pred_detail[xid, 0:-1], # From XGBoost.Booster.predict with pred_contribs=True
    feature_names=feature_names, 
    features=X[xid].toarray()
)

在此处输入图像描述

为什么会这样?哪一个应该是正确的 SHAP 值来解释 XGBoost 模型?

谢谢您的帮助。

跟进@sergey-bushmanov的回复

由于我无法共享自己的数据,因此我使用Kaggle的开放数据集重现了这种情况。

下面是模型训练的代码:


import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import CountVectorizer
import xgboost as xgb
from sklearn.model_selection import train_test_split
import matplotlib.pylab as plt
from matplotlib import pyplot
import io
from scipy.sparse import save_npz


# parameter setting
class_weight = 10
minNgramLength = 1
maxNgramLength = 3
binary = False
min_df = 20

# Convert to fix the problem of encoding
with open('Corona_NLP_train.csv', 'rb') as file:
    csv_file = file.read()
csv_file2 = csv_file.decode('utf-8', errors='replace')

# Read and split data
df_note = pd.read_csv(io.StringIO(csv_file2), encoding='utf-8')
df_note['label'] = np.where(df_note['Sentiment'].str.contains('negative', flags=re.I), 0, 1)

df_train, df_test = train_test_split(df_note, test_size=0.2, random_state=42)

# Tokenization
vectorizer = CountVectorizer(max_df=0.98,
                             min_df=min_df,
                             binary=binary,
                            ngram_range=(minNgramLength, maxNgramLength))
vectorizer.fit(df_train['OriginalTweet'])
X_train = vectorizer.transform(df_train['OriginalTweet']).astype(float)
y_train = df_train['label'].astype(float).reset_index(drop=True)

last_params ={
 'lambda': 0.00016096144192346114,
 'alpha': 0.057770973181367063,
 'eta': 0.19258319097144733,
 'gamma': 0.40032424821976653,
 'max_depth': 9,
 'min_child_weight': 5,
 'subsample': 0.31304772813494836,
 'colsample_bytree': 0.4214452441229869,
 'objective': 'binary:logistic',
 'verbosity': 0,
 'n_estimators': 400
}

classifierCV = xgb.XGBClassifier(**last_params, importance_type='gain')
classifierCV.fit(X_train, y_train, sample_weight=w_train)

# Get the features
feature_names = vectorizer.get_feature_names()

# save model
classifierCV.get_booster().save_model('xgboost.json')

# Save features
import json

with open('feature_list.json', 'w') as file:
    file.write(json.dumps({y:x for x, y in enumerate(feature_names)}))

# save data
save_npz('test_data.npz', X_train)

这个模型的问题仍然存在。

4

1 回答 1

0

哪一个应该是正确的 SHAP 值来解释 XGBoost 模型?

让我们猜测您手头有一个二元分类。然后,您在第二个示例中得到的确实是原始 SHAP 值的正确分解:

In [1]: from scipy.special import expit
In [2]: expit(-1.21)
Out[2]: 0.22970105095339813 

请注意,.2297与您在以下内容中看到的内容接近:

预测概率:0.2292

至于:

为什么会这样?

很可能您在某处有错字,但要确保您必须提供一个完全可重现的示例,包括您的数据,因为在代码方面计算 SHAP 值的两种方式都是正确的。

于 2021-11-12T15:03:05.423 回答