1

Broadly there are two ways:

  1. Call loss.backward() on every batch, but only call optimizer.step() and optimizer.zero_grad() every N batches. Is it the case that the gradients of the N batches are summed up? Hence to maintain the same learning rate per effective batch, we have to divide the learning rate by N?

  2. Accumulate loss instead of gradient, and call (loss / N).backward() every N batches. This is easy to understand, but does it defeat the purpose of saving memory (because the gradients of the N batches are computed at once)? The learning rate doesn't need adjusting to maintain the same learning rate per effective batch, but should be multiplied by N if you want to maintain the same learning rate per example.

Which one is better, or more commonly used in packages such as pytorch-lightning? It seems that optimizer.zero_grad() is a prefect fit for gradient accumulation, therefore (1) should be recommended.

4

1 回答 1

1

您可以使用 PytorchLightning 并获得该框的此功能,请参阅accumulate_grad_batches您也可以配对的 Trainer 参数gradient_clip_val,更多在docs中。

于 2022-01-11T22:08:58.163 回答