machine-learning - 梯度下降似乎失败了

Question

我实现了梯度下降算法来最小化成本函数，以获得确定图像是否具有良好质量的假设。我在 Octave 中做到了。这个想法在某种程度上基于Andrew Ng的机器学习课程中的算法

因此，我有 880 个值“y”，其中包含从 0.5 到 ~12 的值。我在“X”中有 50 到 300 的 880 个值，应该可以预测图像的质量。

遗憾的是，该算法似乎失败了，经过一些迭代后，theta 的值非常小，以至于 theta0 和 theta1 变为“NaN”。我的线性回归曲线有奇怪的值......

这是梯度下降算法的代码： ( theta = zeros(2, 1);, alpha= 0.01, iterations=1500)

function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)

m = length(y); % number of training examples
J_history = zeros(num_iters, 1);

for iter = 1:num_iters


    tmp_j1=0;
for i=1:m, 
    tmp_j1 = tmp_j1+ ((theta (1,1) + theta (2,1)*X(i,2)) - y(i));
end

    tmp_j2=0;
for i=1:m, 
    tmp_j2 = tmp_j2+ (((theta (1,1) + theta (2,1)*X(i,2)) - y(i)) *X(i,2)); 
end

    tmp1= theta(1,1) - (alpha *  ((1/m) * tmp_j1))  
    tmp2= theta(2,1) - (alpha *  ((1/m) * tmp_j2))  

    theta(1,1)=tmp1
    theta(2,1)=tmp2

    % ============================================================

    % Save the cost J in every iteration    
    J_history(iter) = computeCost(X, y, theta);
end
end

这是成本函数的计算：

function J = computeCost(X, y, theta)   %

m = length(y); % number of training examples
J = 0;
tmp=0;
for i=1:m, 
    tmp = tmp+ (theta (1,1) + theta (2,1)*X(i,2) - y(i))^2; %differenzberechnung
end
J= (1/(2*m)) * tmp
end

score 44 · Accepted Answer

如果您想知道如何将看似复杂的for循环向量化并压缩为单行表达式，请继续阅读。向量化形式为：

theta = theta - (alpha/m) * (X' * (X * theta - y))

下面给出了我们如何使用梯度下降算法得到这个向量化表达式的详细解释：

这是微调 θ 值的梯度下降算法：

假设给定以下 X、y 和 θ 值：

m = 训练示例数
n = 特征数 + 1

这里

m = 5（训练示例）
n = 4（特征+1）
X = mxn 矩阵
y = mx 1 向量矩阵
θ = nx 1 向量矩阵
x ⁱ是第 i^个训练样例
x _j是给定训练示例中的第 j^个特征

更远，

h(x) = ([X] * [θ])（我们训练集的预测值的 mx 1 矩阵）
h(x)-y = ([X] * [θ] - [y])（mx 1 我们预测中的错误矩阵）

机器学习的整个目标是最小化预测中的错误。基于以上推论，我们的Errors矩阵是m x 1向量矩阵，如下：

要计算 θ _j的新值，我们必须得到所有误差的总和（m 行）乘以训练集 X 的第 j^个特征值。也就是说，取 E 中的所有值，将它们分别与第 j^个特征相乘对应的训练样例，并将它们加在一起。这将帮助我们获得 θ _j的新值（希望是更好的）。对所有 j 或特征数量重复此过程。在矩阵形式中，这可以写成：

这可以简化为：

[E]' x [X]会给我们一个行向量矩阵，因为 E' 是 1 xm 矩阵，X 是 mxn 矩阵。但是我们有兴趣得到一个列矩阵，因此我们转置了结果矩阵。

更简洁地说，它可以写成：

因为(A * B)' = (B' * A'), 和A'' = A, 我们也可以把上面写成

这是我们开始使用的原始表达式：

theta = theta - (alpha/m) * (X' * (X * theta - y))

score 31 · Accepted Answer

我矢量化了 theta 的东西......可能可以帮助某人

theta = theta - (alpha/m *  (X * theta-y)' * X)';

score 25 · Accepted Answer

我认为你的computeCost功能是错误的。我去年参加了 NG 的课程，我有以下实现（矢量化）：

m = length(y);
J = 0;
predictions = X * theta;
sqrErrors = (predictions-y).^2;

J = 1/(2*m) * sum(sqrErrors);

其余的实现对我来说似乎很好，尽管您也可以对它们进行矢量化。

theta_1 = theta(1) - alpha * (1/m) * sum((X*theta-y).*X(:,1));
theta_2 = theta(2) - alpha * (1/m) * sum((X*theta-y).*X(:,2));

之后，您将临时 theta（此处称为 theta_1 和 theta_2）正确设置回“真实”theta。

通常，矢量化而不是循环更有用，阅读和调试也不那么烦人。

score 2 · Accepted Answer

如果您可以使用最小二乘成本函数，那么您可以尝试使用正规方程而不是梯度下降。它要简单得多——只有一行——而且计算速度更快。

这是正规方程： http: //mathworld.wolfram.com/NormalEquation.html

并以八度音阶形式：

theta = (pinv(X' * X )) * X' * y

这是一个解释如何使用正规方程的教程：http ://www.lauradhamilton.com/tutorial-linear-regression-with-octave

score 2 · Accepted Answer

虽然不像矢量化版本那样可扩展，但基于循环的梯度下降计算应该会产生相同的结果。在上面的示例中，梯度下降最有可能无法计算出正确的 theta 的情况是 alpha 的值。

使用一组经过验证的成本和梯度下降函数以及一组与问题中描述的数据相似的数据，alpha = 0.01如果然而，当设置为时alpha = 0.000001，梯度下降按预期工作，即使经过 100 次迭代。

score 0 · Accepted Answer

此处仅使用向量是 Mathematica 中具有梯度下降的 LR 的紧凑实现：

Theta = {0, 0}
alpha = 0.0001;
iteration = 1500;
Jhist = Table[0, {i, iteration}];
Table[  
  Theta = Theta - 
  alpha * Dot[Transpose[X], (Dot[X, Theta] - Y)]/m; 
  Jhist[[k]] = 
  Total[ (Dot[X, Theta] - Y[[All]])^2]/(2*m); Theta, {k, iteration}]

注意：当然假设 X 是一个 * 2 矩阵，其中 X[[,1]] 仅包含 1s'

score 0 · Accepted Answer

这应该工作： -

theta(1,1) = theta(1,1) - (alpha*(1/m))*((X*theta - y)'* X(:,1) ); 

theta(2,1) = theta(2,1) - (alpha*(1/m))*((X*theta - y)'* X(:,2) );

score 0 · Accepted Answer

这样更干净，也矢量化了

predictions = X * theta;
errorsVector = predictions - y;
theta = theta - (alpha/m) * (X' * errorsVector);

score 0 · Accepted Answer

如果您还记得梯度下降形式机器学习课程的第一个 Pdf 文件，那么您会注意学习率。这是来自上述pdf的注释。

实施说明：如果您的学习率太大，J(theta) 可能会发散，而blow up', resulting in values which are too large for computer calculations. In these situations, Octave/MATLAB will tend to return NaNs. NaN stands for不是一个数字，这通常是由涉及无穷大和 + 无穷大的未确定运算引起的。

machine-learning - 梯度下降似乎失败了

9 回答 9

Related

Reference