2

我遵循本教程并尝试对其进行一些修改,以查看我是否理解正确。但是,当我尝试使用 torch.opim.SGD

import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
device = torch.device("cuda:0")

x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)
w1 = torch.nn.Parameter(torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True))
w2 = torch.nn.Parameter(torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True))
lr = 1e-6
optimizer=torch.optim.SGD([w1,w2],lr=lr)
for t in range(500):
    layer_1 = x.matmul(w1)
    layer_1 = F.relu(layer_1)
    y_pred = layer_1.matmul(w2)
    loss = (y_pred - y).pow(2).sum()
    print(t,loss.item())
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

,我的损失在第三次迭代时上升到 Inf ,然后到 nan ,这与手动更新它完全不同。手动更新它的代码如下(也在教程链接中)。

x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)


w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    y_pred = x.mm(w1).clamp(min=0).mm(w2)
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())
    loss.backward()

    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        w1.grad.zero_()
        w2.grad.zero_()

我想知道我的修改版本(第一个片段)有什么问题。当我用 Adam 替换 SGD 时,结果非常好(每次迭代后减少,没有 Inf 或 nan)。

4

0 回答 0