matlab - Gradient descent and normal equation method for solving linear regression gives different solutions

Question

I'm working on machine learning problem and want to use linear regression as learning algorithm. I have implemented 2 different methods to find parameters theta of linear regression model: Gradient (steepest) descent and Normal equation. On the same data they should both give approximately equal theta vector. However they do not.

Both theta vectors are very similar on all elements but the first one. That is the one used to multiply vector of all 1 added to the data.

Here is how the thetas look like (fist column is output of Gradient descent, second output of Normal equation):

Grad desc Norm eq
-237.7752 -4.6736
-5.8471   -5.8467
9.9174    9.9178
2.1135    2.1134
-1.5001   -1.5003
-37.8558  -37.8505
-1.1024   -1.1116
-19.2969  -19.2956
66.6423   66.6447
297.3666  296.7604
-741.9281 -744.1541
296.4649  296.3494
146.0304  144.4158
-2.9978   -2.9976
-0.8190   -0.8189

What can cause the difference in theta(1, 1) returned by gradient descent compared to theta(1, 1) returned by normal equation? Do I have bug in my code?

Here is my implementation of normal equation in Matlab:

function theta = normalEque(X, y)
    [m, n] = size(X);
    X = [ones(m, 1), X];
    theta = pinv(X'*X)*X'*y;
end

Here is code for gradient descent:

function theta = gradientDesc(X, y)
    options = optimset('GradObj', 'on', 'MaxIter',  9999);
    [theta, ~, ~] = fminunc(@(t)(cost(t, X, y)),...
                    zeros(size(X, 2), 1), options);
end

function [J, grad] = cost(theta, X, y)
    m = size(X, 1);
    X = [ones(m, 1), X];
    J = sum((X * theta - y) .^ 2) ./ (2*m);
    for i = 1:size(theta, 1)
        grad(i, 1) = sum((X * theta - y) .* X(:, i)) ./ m;
    end
end

I pass exactly the same data X and y to both functions (I do not normalize X).

Edit 1:

Based on answers and comments I checked few my code and run some tests.

First I want to check if the problem can be cause by X beeing near singular as suggested by @user1489497's answer. So I replaced pinv by inv - and when run it I really got warning Matrix is close to singular or badly scaled.. To be sure that that is not the problem I obtained much larger dataset and run tests with this new dataset. This time inv(X) did not display the warning and using pinv and inv gave same results. So I hope that X is not close to singular any more.

Then I changed normalEque code as suggested by woodchips so now it looks like:

function theta = normalEque(X, y)
    X = [ones(size(X, 1), 1), X];
    theta = pinv(X)*y;
end

However the problem is still there. New normalEque function on new data that are not close to singular gives different theta as gradientDesc.

To find out which algorithm is buggy I have run linear regression algorithm of data mining software Weka on the same data. Weka computed theta very similar to output of normalEque but different to the output of gradientDesc. So I guess that normalEque is correct and there is a bug in gradientDesc.

Here is comparison of thetas computed by Weka, normalEque and GradientDesc:

Weka(correct) normalEque    gradientDesc
779.8229      779.8163      302.7994
  1.6571        1.6571        1.7064
  1.8430        1.8431        2.3809
 -1.5945       -1.5945       -1.5964
  3.8190        3.8195        5.7486
 -4.8265       -4.8284      -11.1071
 -6.9000       -6.9006      -11.8924
-15.6956      -15.6958      -13.5411
 43.5561       43.5571       31.5036
-44.5380      -44.5386      -26.5137
  0.9935        0.9926        1.2153
 -3.1556       -3.1576       -1.8517
 -0.1927       -0.1919       -0.6583
  2.9207        2.9227        1.5632
  1.1713        1.1710        1.1622
  0.1091        0.1093        0.0084
  1.5768        1.5762        1.6318
 -1.3968       -1.3958       -2.1131
  0.6966        0.6963        0.5630
  0.1990        0.1990       -0.2521
  0.4624        0.4624        0.2921
-12.6013      -12.6014      -12.2014
 -0.1328       -0.1328       -0.1359

I also computed errors as suggested by Justin Peel's answer. Output of normalEque gives slightly lesser squared error but the difference is marginal. What is more when I compute gradient of cost of theta using function cost (the same as the one used by gradientDesc) I got gradient near zero. Same done on output of gradientDesc does not give gradient near zero. Here is what I mean:

>> [J_gd, grad_gd] = cost(theta_gd, X, y, size(X, 1));
>> [J_ne, grad_ne] = cost(theta_ne, X, y, size(X, 1));
>> disp([J_gd, J_ne])
  120.9932  119.1469
>> disp([grad_gd, grad_ne])
  -0.005172856743846  -0.000000000908598
  -0.026126463200876  -0.000000135414602
  -0.008365136595272  -0.000000140327001
  -0.094516503056041  -0.000000169627717
  -0.028805977931093  -0.000000045136985
  -0.004761477661464  -0.000000005065103
  -0.007389474786628  -0.000000005010731
   0.065544198835505  -0.000000046847073
   0.044205371015018  -0.000000046169012
   0.089237705611538  -0.000000046081288
  -0.042549228192766  -0.000000051458654
   0.016339232547159  -0.000000037654965
  -0.043200042729041  -0.000000051748545
   0.013669010209370  -0.000000037399261
  -0.036586854750176  -0.000000027931617
  -0.004761447097231  -0.000000027168798
   0.017311225027280  -0.000000039099380
   0.005650124339593  -0.000000037005759
   0.016225097484138  -0.000000039060168
  -0.009176443862037  -0.000000012831350
   0.055653840638386  -0.000000020855391
  -0.002834810081935  -0.000000006540702
   0.002794661393905  -0.000000032878097

This would suggest that gradient descent simply did not converge to global minimum... But that is hardly the case as I run it for thousands of iterations. So where is the bug?

score 7 · Accepted Answer

我终于有时间回到这个话题了。没有“错误”。

如果矩阵是奇异的，则有无穷多个解。您可以从该集合中选择任何解决方案，并获得同样好的答案。pinv(X)*y 解是很多人喜欢的一个很好的解，因为它是最小范数解。

使用 inv(X)*y 从来没有很好的理由。更糟糕的是，在正规方程上使用逆，因此 inv(X'*X)*X'*y 只是数字废话。我不在乎谁告诉你使用它，他们正在引导你到错误的地方。（是的，它对于条件良好的问题可以接受，但大多数时候你不知道它什么时候会给你带来麻烦。那么为什么要使用它呢？）

正规方程通常是一件坏事，即使你正在解决一个正则化问题。有一些方法可以避免对系统的条件数进行平方，尽管除非被问到，否则我不会解释它们，因为这个答案已经足够长了。

X\y 也会产生一个合理的结果。

绝对没有充分的理由在问题上抛出不受约束的优化器，因为这将产生不稳定的结果，完全取决于您的起始值。

作为一个例子，我将从一个单一的问题开始。

X = repmat([1 2],5,1);
y = rand(5,1);

>> X\y
Warning: Rank deficient, rank = 1, tol =  2.220446e-15. 
ans =
                         0
         0.258777984694222

>> pinv(X)*y
ans =
         0.103511193877689
         0.207022387755377

pinv 和反斜杠返回略有不同的解决方案。事实证明，有一个基本的解决方案，我们可以为 X 的行空间添加任意数量的零空间向量。

null(X)
ans =
         0.894427190999916
        -0.447213595499958

pinv 生成最小范数解。在所有可能产生的解决方案中，这个具有最小 2 范数。

相反，反斜杠生成的解决方案将一个或多个变量设置为零。

但是，如果您使用不受约束的优化器，它将生成一个完全取决于您的起始值的解决方案。同样，可以将 ANLY 数量的空向量添加到您的解决方案中，并且您仍然有一个完全有效的解决方案，具有相同的误差平方和的值。

请注意，即使没有返回奇异性警告，但这并不意味着您的矩阵不接近奇异性。你对这个问题几乎没有改变，所以它仍然很接近，只是不足以触发警告。

score 2 · Accepted Answer

正如其他人所提到的，病态的粗麻布矩阵可能是您的问题的原因。

标准梯度下降算法达到局部最优所需的步数是 hessian 的最大特征值除以最小值的函数（这称为 Hessian 的条件数）。因此，如果您的矩阵是病态的，那么梯度下降可能需要大量的迭代才能收敛到最优值。（当然，对于单数情况，它可以收敛到很多点。）

我建议尝试三种不同的方法来验证无约束优化算法是否适用于您的问题（它应该）：1）通过计算随机输入的已知线性函数的结果并添加少量高斯噪声来生成一些合成数据. 确保您拥有比维度更多的数据点。这应该产生一个非奇异的粗麻布。2）在您的误差函数中添加正则化项以增加粗麻布的条件数。3）使用二阶方法，如共轭梯度或 L-BFGS，而不是梯度下降，以减少算法收敛所需的步骤数。（您可能需要与#2 一起执行此操作）。

score 1 · Accepted Answer

你能多发一点关于你 X 的样子吗？您使用的是 Moore-Penrose 伪逆的 pinv()。如果矩阵是病态的，这可能会导致获得逆矩阵的问题。我敢打赌，梯度下降法更接近标记。

score 0 · Accepted Answer

您应该看到哪种方法实际上给您的错误最小。这将表明哪种方法正在挣扎。我怀疑正规方程法是有问题的解决方案，因为如果 X 是病态的，那么你可能会遇到一些问题。

您可能应该替换theta = X\y将使用 QR 分解方法来解决它的正规方程解决方案。

matlab - Gradient descent and normal equation method for solving linear regression gives different solutions

Edit 1:

4 回答 4

Related

Reference