2

我正在构建一个 GBM 来计算可能性非常低的东西,并且我的模型的性能与具有我的特征的随机数一致(即很糟糕),所以我试图使用 Smote 来克服我的结果的支配地位(98.55% 0, 1.45 % 1)。

这里的解决方案似乎暗示我的问题来自不是数组的类型,但我的代码暗示它是。

我的数据如下所示:

X = num_df.drop(columns=[u'Has Claim'])
y = num_df[u'Has Claim']

X
   Underwriting Year  Public Liability Limit  Employers Liability Limit  \
0               2014                 1000000                          0   
1               2014                 5000000                          0   
2               2014                 5000000                   10000000   
3               2014                 2000000                          0   
4               2014                 1000000                          0   
   Tools Sum Insured  Professional Indemnity Limit  \
0                0.0                         50000   
1                0.0                             0   
2             4000.0                             0   
3             2000.0                             0   
4                0.0                       1000000   

   Contract Works Sum Insured  Hired in Plan Sum Insured  Manual EE  \
0                           0                          0          1   
1                           0                          0          1   
2                           0                          0          1   
3                           0                          0          6   
4                           0                          0          1   

   Clerical EE  Subcontractor EE  rand_1  rand_2  rand_3  rand_4  rand_5  \
0            0                 0       1       2       2       1       5   
1            0                 0       4       3       1       2       2   
2            7                 0       2       2       4       1       5   
3            4                 0       5       4       1       2       2   
4            0                 0       4       3       4       5       2   

   rand_6  rand_7  rand_8  rand_9  rand_10  
0       2       3       5       1        1  
1       4       3       1       1        5  
2       2       5       3       1        5  
3       1       5       1       3        2  
4       5       2       5       4        3  

Y
0    0
1    0
2    0
3    0
4    0
Name: Has Claim, dtype: int64

我做了一个火车测试拆分

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2, 
                                                    random_state=42)

当我适合我的模型时,它可以工作

model.fit(X_train, y_train)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
   colsample_bytree=0.5, gamma=0, learning_rate=0.1, max_delta_step=0,
   max_depth=5, min_child_weight=1, missing=None, n_estimators=1000,
   n_jobs=1, nthread=4, objective='binary:logistic', random_state=0,
   reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=42, silent=True,
   subsample=0.8)

但是,如果我使用

smt = SMOTE()
X_train, y_train = smt.fit_sample(X_train,
                                  y_train)

然后改装我的模型并使用

y_pred = model.predict(X_test)

然后我得到

ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19'] [u'Underwriting Year', u'Public Liability Limit', u'Employers Liability Limit', u'Tools Sum Insured', u'Professional Indemnity Limit', u'Contract Works Sum Insured', u'Hired in Plan Sum Insured', u'Manual EE', u'Clerical EE', u'Subcontractor EE', u'rand_1', u'rand_2', u'rand_3', u'rand_4', u'rand_5', u'rand_6', u'rand_7', u'rand_8', u'rand_9', u'rand_10']
expected f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f18, f19, f12, f13, f10, f11, f16, f17, f14, f15 in input data
training data did not have the following fields: rand_6, rand_7, rand_4, rand_5, rand_2, rand_3, rand_1, Public Liability Limit, Subcontractor EE, Professional Indemnity Limit, rand_8, rand_9, Manual EE, Employers Liability Limit, rand_10, Contract Works Sum Insured, Underwriting Year, Tools Sum Insured, Clerical EE, Hired in Plan Sum Insured

我希望能够使用我更新的模型进行预测

我是否误解了 SMOTE 的工作原理?我没有正确应用它吗?

4

0 回答 0