我正在尝试使用两种不同的方法编写多元线性回归问题。一种是简单的,如下所述:
from sklearn.model_selection import train_test_split
X = df[['geo','age','v_age']]
y = df['freq']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Fitting model
regr2 = linear_model.LinearRegression()
regr2.fit(X_train, y_train)
print(metrics.mean_squared_error(ypred,y_test))
print(r2_score(y_test,ypred))
上面的代码给了我 0.46 的 MSE 和 '0.0012' 的 Y2 分数,这真的很不合适。同时当我使用:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=1) #Degree = 1 should give the same equation as above code block
X_ = poly.fit_transform(X)
y = y.values.reshape(-1, 1)
predict_ = poly.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X_, predict_, test_size=0.33, random_state=42)
# Fitting model
regr2 = linear_model.LinearRegression()
regr2.fit(X_train, y_train)
print(metrics.mean_squared_error(ypred,y_test))
print(r2_score(y_test,ypred))
使用 PolynomialFeatures 的 MSE 为 0.23,Y2 分数为“0.5”,这要好得多。我不明白使用相同回归方程的两种方法如何给出如此不同的答案。休息其他一切都一样。