我正在构建一个 GBM 来计算可能性非常低的东西,并且我的模型的性能与具有我的特征的随机数一致(即很糟糕),所以我试图使用 Smote 来克服我的结果的支配地位(98.55% 0, 1.45 % 1)。
这里的解决方案似乎暗示我的问题来自不是数组的类型,但我的代码暗示它是。
我的数据如下所示:
X = num_df.drop(columns=[u'Has Claim'])
y = num_df[u'Has Claim']
X
Underwriting Year Public Liability Limit Employers Liability Limit \
0 2014 1000000 0
1 2014 5000000 0
2 2014 5000000 10000000
3 2014 2000000 0
4 2014 1000000 0
Tools Sum Insured Professional Indemnity Limit \
0 0.0 50000
1 0.0 0
2 4000.0 0
3 2000.0 0
4 0.0 1000000
Contract Works Sum Insured Hired in Plan Sum Insured Manual EE \
0 0 0 1
1 0 0 1
2 0 0 1
3 0 0 6
4 0 0 1
Clerical EE Subcontractor EE rand_1 rand_2 rand_3 rand_4 rand_5 \
0 0 0 1 2 2 1 5
1 0 0 4 3 1 2 2
2 7 0 2 2 4 1 5
3 4 0 5 4 1 2 2
4 0 0 4 3 4 5 2
rand_6 rand_7 rand_8 rand_9 rand_10
0 2 3 5 1 1
1 4 3 1 1 5
2 2 5 3 1 5
3 1 5 1 3 2
4 5 2 5 4 3
Y
0 0
1 0
2 0
3 0
4 0
Name: Has Claim, dtype: int64
我做了一个火车测试拆分
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.2,
random_state=42)
当我适合我的模型时,它可以工作
model.fit(X_train, y_train)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bytree=0.5, gamma=0, learning_rate=0.1, max_delta_step=0,
max_depth=5, min_child_weight=1, missing=None, n_estimators=1000,
n_jobs=1, nthread=4, objective='binary:logistic', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=42, silent=True,
subsample=0.8)
但是,如果我使用
smt = SMOTE()
X_train, y_train = smt.fit_sample(X_train,
y_train)
然后改装我的模型并使用
y_pred = model.predict(X_test)
然后我得到
ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19'] [u'Underwriting Year', u'Public Liability Limit', u'Employers Liability Limit', u'Tools Sum Insured', u'Professional Indemnity Limit', u'Contract Works Sum Insured', u'Hired in Plan Sum Insured', u'Manual EE', u'Clerical EE', u'Subcontractor EE', u'rand_1', u'rand_2', u'rand_3', u'rand_4', u'rand_5', u'rand_6', u'rand_7', u'rand_8', u'rand_9', u'rand_10']
expected f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f18, f19, f12, f13, f10, f11, f16, f17, f14, f15 in input data
training data did not have the following fields: rand_6, rand_7, rand_4, rand_5, rand_2, rand_3, rand_1, Public Liability Limit, Subcontractor EE, Professional Indemnity Limit, rand_8, rand_9, Manual EE, Employers Liability Limit, rand_10, Contract Works Sum Insured, Underwriting Year, Tools Sum Insured, Clerical EE, Hired in Plan Sum Insured
我希望能够使用我更新的模型进行预测
我是否误解了 SMOTE 的工作原理?我没有正确应用它吗?