machine-learning - 在随机梯度下降过程中，这两种更新假设的方式有什么区别？

Question

我有一个关于在随机 GD 期间更新 theta 的问题。我有两种更新 theta 的方法：

1）使用前面的theta，得到所有样本的所有假设，然后通过每个样本更新theta。喜欢：

hypothese = np.dot(X, theta)
for i in range(0, m):
    theta = theta + alpha * (y[i] - hypothese[i]) * X[i]

2）另一种方式：在扫描样本期间，使用最新的theta更新hypothese[i]。喜欢：

for i in range(0, m):
    h = np.dot(X[i], theta)
    theta = theta + alpha * (y[i] - h) * X[i]

我检查了SGD代码，似乎第二种方式是正确的。但是在我的编码过程中，第一个会收敛得更快，结果也比第二个好。为什么错误的方式比正确的方式表现得更好？

我还附上了完整的代码如下：

def SGD_method1():
maxIter = 100 # max iterations
alpha = 1e4 # learning rate
m, n = np.shape(X)  # X[m,n], m:#samples, n:#features
theta = np.zeros(n) # initial theta
for iter in range(0, maxIter):
    hypothese = np.dot(X, theta)  # update all the hypoes using the same theta
    for i in range(0, m):
        theta = theta + alpha * (y[i] - hypothese[i]) * X[i]
return theta

def SGD_method2():
maxIter = 100 # max iterations
alpha = 1e4 # learning rate
m, n = np.shape(X)  # X[m,n], m:#samples, n:#features
theta = np.zeros(n) # initial theta
for iter in range(0, maxIter):
    for i in range(0, m):
        h = np.dot(X[i], theta)  #  update on hypo using the latest theta
        theta = theta + alpha * (y[i] -h) * X[i]
return theta

score 0 · Accepted Answer

第一个代码不是SGD。这是一个“传统”（批量）梯度下降。随机性来自基于为一个样本（或小批量，称为 mini-bach SGD）计算的梯度的更新。这显然不是误差函数的正确梯度（它是所有训练样本的误差之和），但可以证明，在合理的条件下，这样的过程会收敛到局部最优值。由于其简单性和（在许多情况下）更便宜的计算，随机更新在许多应用程序中更可取。两种算法都是正确的（两者都在合理的假设下保证收敛到局部最优），特定策略的选择取决于特定问题（尤其是其规模和其他要求）。

machine-learning - 在随机梯度下降过程中，这两种更新假设的方式有什么区别？

1 回答 1

Related

Reference