我已经为二进制分类安装了 XGBoost 模型。我试图了解拟合模型并尝试使用SHAP来解释预测。
但是,我对 SHAP 生成的力图感到困惑。我预计输出值应该小于 0,因为预测概率小于 0.5。但是,SHAP 值显示8.12
。
下面是我生成结果的代码。
import shap
import xgboost as xgb
import json
from scipy.sparse import load_npz
print('Version of SHAP: {}'.format(shap.__version__))
print('Version of XGBoost: {}'.format(xgb.__version__))
SHAP 版本:0.39.0
XGBoost 版本:1.4.1
# Read the data
X = load_npz('test_data.npz')
X_dmatrix = xgb.DMatrix(X)
# Read the selected features
with open('feature_list.json', 'r') as file:
feature_list = json.load(file)
feature_names = [f'Feature {x:04d}' for x in range(len(feature_list))]
# Read the XGBoost model
xgb_model = xgb.Booster()
xgb_model.load_model('xgboost.json')
# Model prediction
model_pred_detail = xgb_model.predict(X_dmatrix, pred_contribs=True)
model_pred_prob = xgb_model.predict(X_dmatrix)
model_pred_detail.shape
(7887, 501)
# Random select a case
xid=4549
print('Predict proba: {:.04f}'.format(model_pred_prob[xid]))
预测概率:0.2292
# Doing SHAP way (https://github.com/slundberg/shap)
explainer = shap.Explainer(xgb_model, feature_names=feature_names, algorithm='tree')
shap_values = explainer(X.toarray())
shap.plots.force(shap_values[xid])
但是,如果我使用 XGBoost 库中的 SHAP 值,我会得到另一个图,这看起来与我的预期相似。
shap.force_plot(
model_pred_detail[xid, -1], # From XGBoost.Booster.predict with pred_contribs=True
model_pred_detail[xid, 0:-1], # From XGBoost.Booster.predict with pred_contribs=True
feature_names=feature_names,
features=X[xid].toarray()
)
为什么会这样?哪一个应该是正确的 SHAP 值来解释 XGBoost 模型?
谢谢您的帮助。
跟进@sergey-bushmanov的回复
由于我无法共享自己的数据,因此我使用Kaggle的开放数据集重现了这种情况。
下面是模型训练的代码:
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import CountVectorizer
import xgboost as xgb
from sklearn.model_selection import train_test_split
import matplotlib.pylab as plt
from matplotlib import pyplot
import io
from scipy.sparse import save_npz
# parameter setting
class_weight = 10
minNgramLength = 1
maxNgramLength = 3
binary = False
min_df = 20
# Convert to fix the problem of encoding
with open('Corona_NLP_train.csv', 'rb') as file:
csv_file = file.read()
csv_file2 = csv_file.decode('utf-8', errors='replace')
# Read and split data
df_note = pd.read_csv(io.StringIO(csv_file2), encoding='utf-8')
df_note['label'] = np.where(df_note['Sentiment'].str.contains('negative', flags=re.I), 0, 1)
df_train, df_test = train_test_split(df_note, test_size=0.2, random_state=42)
# Tokenization
vectorizer = CountVectorizer(max_df=0.98,
min_df=min_df,
binary=binary,
ngram_range=(minNgramLength, maxNgramLength))
vectorizer.fit(df_train['OriginalTweet'])
X_train = vectorizer.transform(df_train['OriginalTweet']).astype(float)
y_train = df_train['label'].astype(float).reset_index(drop=True)
last_params ={
'lambda': 0.00016096144192346114,
'alpha': 0.057770973181367063,
'eta': 0.19258319097144733,
'gamma': 0.40032424821976653,
'max_depth': 9,
'min_child_weight': 5,
'subsample': 0.31304772813494836,
'colsample_bytree': 0.4214452441229869,
'objective': 'binary:logistic',
'verbosity': 0,
'n_estimators': 400
}
classifierCV = xgb.XGBClassifier(**last_params, importance_type='gain')
classifierCV.fit(X_train, y_train, sample_weight=w_train)
# Get the features
feature_names = vectorizer.get_feature_names()
# save model
classifierCV.get_booster().save_model('xgboost.json')
# Save features
import json
with open('feature_list.json', 'w') as file:
file.write(json.dumps({y:x for x, y in enumerate(feature_names)}))
# save data
save_npz('test_data.npz', X_train)
这个模型的问题仍然存在。