为了改进我的线性回归模型,我被建议使用标准化,即 RobustScaler 以获得更好的性能。我的训练集和验证集的形状:
Train set: (4304, 20) (4304,)
Validation set: (1435, 20) (1435,)
所以我将我的 X 转换为训练集和验证集:
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_train_robust_scaler = scaler.fit_transform(X_train.copy())
X_valid_robust_scaler = scaler.transform(X_valid.copy())
然后我运行模型并使用函数 print_score() 打印分数:
from sklearn import linear_model
regr_vol_2 = linear_model.LinearRegression()
regr_vol_2.fit(X_train_robust_scaler, y_train)
def print_score(m, X_train: pd.DataFrame, X_valid: pd.DataFrame, y_train: pd.Series, y_valid:pd.Series):
'''Function takes a model and calculates and prints its RMSE values and r²
scores for train and validation set. Also attaches oob_score for Random
Forest model.
Parameters:
-----------
(1) m --> given model;
(2) X_train --> training set of independent features;
(3) X_valid --> validation set of independent features;
(4) y_train --> training set of dependent features;
(5) y_valid --> validation set of dependent features;
-----------
Returns scoring values in the following order:
[training rmse, validation rmse, r² for training set, r² for validation set,
oob_score_]
'''
res = [rmse(m.predict(X_train), y_train),
rmse(m.predict(X_valid), y_valid),
m.score(X_train, y_train), m.score(X_valid, y_valid)]
if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
return print(res)
print_score(regr_vol_2,X_train_robust_scaler, X_valid_robust_scaler,y_train, y_valid)
输出 | [training rmse, validation rmse, r² for training set, r² for validation set |
---|---|
前: | [260.86301672800016, 271.8005003802866, 0.6184501389479591, 0.5976532655109332] |
后: | [260.8630167262612, 271.800437195055, 0.6184501389530468, 0.5976534525773189] |
两者的结果完全相同,我做错了什么?我也应该为y_train和y_valid使用Robustscaler()吗?如果我这样做:
scaler_y = RobustScaler()
y_train_robust_scaler = scaler_y.fit_transform(y_train[:,None])
y_valid_robust_scaler = scaler_y.transform(y_valid[:,None])
我得到的和没有它一样:| [training rmse, validation rmse, r² for training set, r² for validation set | | -------------- | | [260.8630167262612, 271.800437195055, 0.6184501389530468, 0.5976534525773189]|
或者我应该在拆分之前一次对整个数据使用Robustscaler() ?如果“是”,如果在训练/验证拆分后估算 NaN 值,我该怎么做。