I'm working on machine learning problem and want to use linear regression as learning algorithm. I have implemented 2 different methods to find parameters theta
of linear regression model: Gradient (steepest) descent and Normal equation. On the same data they should both give approximately equal theta
vector. However they do not.
Both theta
vectors are very similar on all elements but the first one. That is the one used to multiply vector of all 1 added to the data.
Here is how the theta
s look like (fist column is output of Gradient descent, second output of Normal equation):
Grad desc Norm eq
-237.7752 -4.6736
-5.8471 -5.8467
9.9174 9.9178
2.1135 2.1134
-1.5001 -1.5003
-37.8558 -37.8505
-1.1024 -1.1116
-19.2969 -19.2956
66.6423 66.6447
297.3666 296.7604
-741.9281 -744.1541
296.4649 296.3494
146.0304 144.4158
-2.9978 -2.9976
-0.8190 -0.8189
What can cause the difference in theta(1, 1)
returned by gradient descent compared to theta(1, 1)
returned by normal equation? Do I have bug in my code?
Here is my implementation of normal equation in Matlab:
function theta = normalEque(X, y)
[m, n] = size(X);
X = [ones(m, 1), X];
theta = pinv(X'*X)*X'*y;
end
Here is code for gradient descent:
function theta = gradientDesc(X, y)
options = optimset('GradObj', 'on', 'MaxIter', 9999);
[theta, ~, ~] = fminunc(@(t)(cost(t, X, y)),...
zeros(size(X, 2), 1), options);
end
function [J, grad] = cost(theta, X, y)
m = size(X, 1);
X = [ones(m, 1), X];
J = sum((X * theta - y) .^ 2) ./ (2*m);
for i = 1:size(theta, 1)
grad(i, 1) = sum((X * theta - y) .* X(:, i)) ./ m;
end
end
I pass exactly the same data X
and y
to both functions (I do not normalize X
).
Edit 1:
Based on answers and comments I checked few my code and run some tests.
First I want to check if the problem can be cause by X beeing near singular as suggested by @user1489497's answer. So I replaced pinv by inv - and when run it I really got warning Matrix is close to singular or badly scaled.
. To be sure that that is not the problem I obtained much larger dataset and run tests with this new dataset. This time inv(X)
did not display the warning and using pinv
and inv
gave same results. So I hope that X
is not close to singular any more.
Then I changed normalEque
code as suggested by woodchips so now it looks like:
function theta = normalEque(X, y)
X = [ones(size(X, 1), 1), X];
theta = pinv(X)*y;
end
However the problem is still there. New normalEque
function on new data that are not close to singular gives different theta
as gradientDesc
.
To find out which algorithm is buggy I have run linear regression algorithm of data mining software Weka on the same data. Weka computed theta very similar to output of normalEque
but different to the output of gradientDesc
. So I guess that normalEque
is correct and there is a bug in gradientDesc
.
Here is comparison of theta
s computed by Weka, normalEque
and GradientDesc
:
Weka(correct) normalEque gradientDesc
779.8229 779.8163 302.7994
1.6571 1.6571 1.7064
1.8430 1.8431 2.3809
-1.5945 -1.5945 -1.5964
3.8190 3.8195 5.7486
-4.8265 -4.8284 -11.1071
-6.9000 -6.9006 -11.8924
-15.6956 -15.6958 -13.5411
43.5561 43.5571 31.5036
-44.5380 -44.5386 -26.5137
0.9935 0.9926 1.2153
-3.1556 -3.1576 -1.8517
-0.1927 -0.1919 -0.6583
2.9207 2.9227 1.5632
1.1713 1.1710 1.1622
0.1091 0.1093 0.0084
1.5768 1.5762 1.6318
-1.3968 -1.3958 -2.1131
0.6966 0.6963 0.5630
0.1990 0.1990 -0.2521
0.4624 0.4624 0.2921
-12.6013 -12.6014 -12.2014
-0.1328 -0.1328 -0.1359
I also computed errors as suggested by Justin Peel's answer. Output of normalEque
gives slightly lesser squared error but the difference is marginal. What is more when I compute gradient of cost of theta
using function cost
(the same as the one used by gradientDesc
) I got gradient near zero. Same done on output of gradientDesc
does not give gradient near zero. Here is what I mean:
>> [J_gd, grad_gd] = cost(theta_gd, X, y, size(X, 1));
>> [J_ne, grad_ne] = cost(theta_ne, X, y, size(X, 1));
>> disp([J_gd, J_ne])
120.9932 119.1469
>> disp([grad_gd, grad_ne])
-0.005172856743846 -0.000000000908598
-0.026126463200876 -0.000000135414602
-0.008365136595272 -0.000000140327001
-0.094516503056041 -0.000000169627717
-0.028805977931093 -0.000000045136985
-0.004761477661464 -0.000000005065103
-0.007389474786628 -0.000000005010731
0.065544198835505 -0.000000046847073
0.044205371015018 -0.000000046169012
0.089237705611538 -0.000000046081288
-0.042549228192766 -0.000000051458654
0.016339232547159 -0.000000037654965
-0.043200042729041 -0.000000051748545
0.013669010209370 -0.000000037399261
-0.036586854750176 -0.000000027931617
-0.004761447097231 -0.000000027168798
0.017311225027280 -0.000000039099380
0.005650124339593 -0.000000037005759
0.016225097484138 -0.000000039060168
-0.009176443862037 -0.000000012831350
0.055653840638386 -0.000000020855391
-0.002834810081935 -0.000000006540702
0.002794661393905 -0.000000032878097
This would suggest that gradient descent simply did not converge to global minimum... But that is hardly the case as I run it for thousands of iterations. So where is the bug?