0

为了改进我的线性回归模型,我被建议使用标准化,即 RobustScaler 以获得更好的性能。我的训练集和验证集的形状:

Train set: (4304, 20) (4304,)
Validation set: (1435, 20) (1435,)

所以我将我的 X 转换为训练集和验证集:

from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
X_train_robust_scaler = scaler.fit_transform(X_train.copy())
X_valid_robust_scaler = scaler.transform(X_valid.copy())

然后我运行模型并使用函数 print_score() 打印分数:

from sklearn import linear_model

regr_vol_2 = linear_model.LinearRegression()
regr_vol_2.fit(X_train_robust_scaler, y_train)

def print_score(m, X_train: pd.DataFrame, X_valid: pd.DataFrame, y_train: pd.Series, y_valid:pd.Series):
'''Function takes a model and calculates and prints its RMSE values and r² 
scores for train and validation set. Also attaches oob_score for Random 
Forest model.
Parameters:
-----------
(1) m --> given model;
(2) X_train --> training set of independent features;
(3) X_valid --> validation set of independent features;
(4) y_train --> training set of dependent features;
(5) y_valid --> validation set of dependent features;
-----------
Returns scoring values in the following order: 
[training rmse, validation rmse, r² for training set, r² for validation set, 
oob_score_]
'''
res = [rmse(m.predict(X_train), y_train),
       rmse(m.predict(X_valid), y_valid),
       m.score(X_train, y_train), m.score(X_valid, y_valid)]
if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
return print(res)


print_score(regr_vol_2,X_train_robust_scaler, X_valid_robust_scaler,y_train, y_valid)
输出 [training rmse, validation rmse, r² for training set, r² for validation set
前: [260.86301672800016, 271.8005003802866, 0.6184501389479591, 0.5976532655109332]
后: [260.8630167262612, 271.800437195055, 0.6184501389530468, 0.5976534525773189]

两者的结果完全相同,我做错了什么?我也应该为y_trainy_valid使用Robustscaler()吗?如果我这样做:

scaler_y = RobustScaler()
y_train_robust_scaler = scaler_y.fit_transform(y_train[:,None])
y_valid_robust_scaler = scaler_y.transform(y_valid[:,None])

我得到的和没有它一样:| [training rmse, validation rmse, r² for training set, r² for validation set | | -------------- | | [260.8630167262612, 271.800437195055, 0.6184501389530468, 0.5976534525773189]|

或者我应该在拆分之前一次对整个数据使用Robustscaler() ?如果“是”,如果在训练/验证拆分后估算 NaN 值,我该怎么做。

4

1 回答 1

0

Scaling does not affect an unpenalized regression. It can improve convergence of the solver, but if the model is converging satisfactorily on the raw data, the results will be the same.

于 2021-01-10T17:47:04.813 回答