python - 多项式回归度增加后，训练分数降低

Question

我正在尝试使用线性回归将多项式拟合到来自正弦信号的一组点，并添加了一些噪声，使用linear_model.LinearRegressionfrom sklearn。

正如预期的那样，训练和验证分数随着多项式次数的增加而增加，但在大约 20 次之后，事情开始变得奇怪并且分数开始下降，并且模型返回的多项式看起来一点也不像我用来训练它。

下面是一些可以看到这一点的图，以及生成回归模型和图的代码：

在 degree=17 之前，事情如何运作良好。原始数据 VS 预测：

在那之后它变得更糟：

验证曲线，增加多项式的次数：

from sklearn.pipeline import make_pipeline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.learning_curve import validation_curve

def make_data(N, err=0.1, rseed=1):
    rng = np.random.RandomState(1)
    x = 10 * rng.rand(N)
    X = x[:, None]
    y = np.sin(x) + 0.1 * rng.randn(N)
    if err > 0:
        y += err * rng.randn(N)
    return X, y

def PolynomialRegression(degree=4):
    return make_pipeline(PolynomialFeatures(degree),
                         LinearRegression())


X, y = make_data(400)

X_test = np.linspace(0, 10, 500)[:, None]
degrees = np.arange(0, 40)

plt.figure(figsize=(16, 8))
plt.scatter(X.flatten(), y)
for degree in degrees:
    y_test = PolynomialRegression(degree).fit(X, y).predict(X_test)
    plt.plot(X_test, y_test, label='degre={0}'.format(degree))
plt.title('Original data VS predicted values for different degrees')
plt.legend(loc='best');


degree = np.arange(0, 40)
train_score, val_score = validation_curve(PolynomialRegression(), X, y,
                                                 'polynomialfeatures__degree',
                                                 degree, cv=7)

plt.figure(figsize=(12, 6))
plt.plot(degree, np.median(train_score, 1), marker='o', 
         color='blue', label='training score')
plt.plot(degree, np.median(val_score, 1), marker='o',
         color='red', label='validation score')
plt.legend(loc='best')
plt.ylim(0, 1)
plt.title('Learning curve, increasing the degree of the polynomium')
plt.xlabel('degree')
plt.ylabel('score');

我知道预期的是，当模型的复杂性增加时，验证分数会下降，但为什么训练分数也会下降呢？我在这里能错过什么？

score 1 · Accepted Answer

首先，这是通过True为模型设置归一化标志来修复它的方法；

def PolynomialRegression(degree=4):
    return make_pipeline(PolynomialFeatures(degree),
                         LinearRegression(normalize=True))

但为什么？在线性回归fit()中，函数找到最佳拟合模型，Moore–Penrose inverse这是计算least-square解决方案的常用方法。当您添加值的多项式时，如果您不进行归一化，您的增强特征会很快变得非常大。这些大值支配了最小二乘法计算的成本，并导致模型拟合更大的值，即更高阶的多项式值而不是数据。

情节看起来更好，而且它们应该是这样的。

score 0 · Accepted Answer

由于模型对训练数据的过度拟合，预计训练分数也会下降。由于正弦函数的泰勒级数展开，验证错误下降。因此，随着多项式次数的增加，您的模型会改进以更好地拟合正弦曲线。

在理想情况下，如果您没有扩展到无限度的函数，您会看到训练误差下降（不是单调的，而是一般情况下）并且验证误差在一定程度后上升（低度数高 -> 更高度数低-> 在那之后增加）。

python - 多项式回归度增加后，训练分数降低

2 回答 2

Related

Reference