python - 随机梯度下降收敛准则

Question

目前我的 SGD 收敛标准检查 MSE 错误率是否在特定边界内。

def compute_mse(data, labels, weights):
    m = len(labels)
    hypothesis = np.dot(data,weights)
    sq_errors = (hypothesis - labels) ** 2
    mse = np.sum(sq_errors)/(2.0*m)
    return mse

cur_mse = 1.0
prev_mse = 100.0
m = len(labels)
while cur_mse/prev_mse < 0.99999:
    prev_mse = cur_mse

    for i in range(m):
        d = np.array(data[i])
        hypothesis = np.dot(d, weights)
        gradient = np.dot((labels[i] - hypothesis), d)/m
        weights = weights + (alpha * gradient)

    cur_mse = compute_mse(data, labels, weights)
    if cur_mse > prev_mse:
        return

权重会更新到训练集中的单个数据点。

alpha 为 0.001，模型应该在几次迭代中收敛，但我没有收敛。这个收敛标准是不是太严格了？

score 2 · Accepted Answer

我会试着回答这个问题。首先，随机梯度下降的伪代码如下所示：

input: f(x), alpha, initial x (guess or random)
output: min_x f(x) # x that minimizes f(x)

while True:
    shuffle data # good practice, not completely needed
    for d in data:
        x -= alpha * grad(f(x)) # df/dx
    if <stopping criterion>:
        break

可以将其他regularization参数添加到要最小化的函数中，例如l1 penalty避免过度拟合。

回到你的问题，看看你的数据和梯度的定义，看起来你想解决一个简单的线性方程组的形式：

Ax = b

这产生了客观函数：

f(x) = ||Ax - b||^2

随机梯度下降一次使用一行数据：

||A_i x - b||

哪里|| o ||是欧几里得范数，_i表示一行的索引。

在这里，A是你的data，x是你的weights，b是你的labels。

然后将函数的梯度计算为：

grad(f(x)) = 2 * A.T (Ax - b)

或者在随机梯度下降的情况下：

2 * A_i.T (A_i x - b)

where.T表示转置。

将所有内容放回您的代码中...首先我将设置一个合成数据：

A = np.random.randn(100, 2) # 100x2 data
x = np.random.randn(2, 1) # 2x1 weights
b = np.random.randint(0, 2, 100).reshape(100, 1) # 100x1 labels
b[b == 0] = -1 # labels in {-1, 1}

然后，定义参数：

alpha = 0.001
cur_mse = 100.
prev_mse = np.inf
it = 0
max_iter = 100
m = A.shape[0]
idx = range(m)

并循环！

while cur_mse/prev_mse < 0.99999 and it < max_iter:
    prev_mse = cur_mse
    shuffle(idx)

    for i in idx:
        d = A[i:i+1]
        y = b[i:i+1]
        h = np.dot(d, x)
        dx = 2 * np.dot(d.T, (h - y))
        x -= (alpha * dx)

    cur_mse = np.mean((A.dot(x) - b)**2)
    if cur_mse > prev_mse:
        raise Exception("Not converging")
    it += 1

此代码与您的代码几乎相同，但添加了一些内容：

基于迭代次数的另一个停止标准（如果系统不收敛或收敛速度太慢，则避免永远循环）
重新定义渐变dx（仍然与您的相似）。您将符号倒置，因此权重更新为正+，因为在我的示例中为负-（这是有道理的，因为您正在梯度下降）。
data和的索引labels。虽然data[i]给出了一个大小的元组(2,)（在这种情况下为100x2数据），但使用花哨的索引data[i:i+1]将返回数据的视图而不对其进行整形（例如使用 shape (1, 2)），因此将允许您执行正确的矩阵乘法。

您可以根据可接受的mse错误添加第三个停止标准，即：if cur_mse < 1e-3: break.

这个算法，随机数据，对我来说在 20-40 次迭代中收敛（取决于生成的随机数据）。

所以...假设这是您要最小化的功能，如果此方法对您不起作用，则可能意味着您的系统未确定（您的训练数据少于特征，这意味着A宽多于高） .

希望能帮助到你！

python - 随机梯度下降收敛准则

1 回答 1

Related

Reference