pytorch - OverflowError: (34, 'Numerical result out of range') in PyTorch

Question

I am getting the following error (see the stacktrace) when I ran my code in a different GPU (Tesla K-20, cuda 7.5 installed, 6GB memory). Code works fine if I run in GeForce 1080 or Titan X GPU.

Stacktrace:

File "code/source/main.py", line 68, in <module>
    train.train_epochs(train_batches, dev_batches, args.epochs)
  File "/gpfs/home/g/e/geniiexe/BigRed2/code/source/train.py", line 34, in train_epochs
    losses = self.train(train_batches, dev_batches, (epoch + 1))
  File "/gpfs/home/g/e/geniiexe/BigRed2/code/source/train.py", line 76, in train
    self.optimizer.step()
  File "/gpfs/home/g/e/geniiexe/BigRed2/anaconda3/lib/python3.5/site-packages/torch/optim/adam.py", line 70, in step
    bias_correction1 = 1 - beta1 ** state['step']
OverflowError: (34, 'Numerical result out of range')

So, what can be the reason to get such error in a different GPU (Tesla K-20) while it works fine in GeForce or Titan X GPU? Moreover what the error means? Is it related to memory overflow which I don't think so.

score 0 · Accepted Answer

万一有人像我一样来到这里，寻找相同的错误，但使用 scikit-learn 的 MLPClassifier 的 CPU，上述修复巧合地是修复 sklearn 代码的足够好的提示。

修复方法是：在文件 .../site-packages/sklearn/neural_network/_stochastic_optimizers.py

改变这个：

self.learning_rate = (self.learning_rate_init *
                      np.sqrt(1 - self.beta_2 ** self.t) /
                      (1 - self.beta_1 ** self.t))

对此：

orig_self_t = self.t
new_self_t = min(orig_self_t, 1022)
self.learning_rate = (self.learning_rate_init *
                          np.sqrt(1 - self.beta_2 ** new_self_t) /
                          (1 - self.beta_1 ** new_self_t))

score 0 · Accepted Answer

建议的一种解决方法discuss.pytorch.org如下。

替换以下行adam.py：-

bias_correction1 = 1 - beta1 ** state['step']
bias_correction2 = 1 - beta2 ** state['step']

经过

bias_correction1 = 1 - beta1 ** min(state['step'], 1022)
bias_correction2 = 1 - beta2 ** min(state['step'], 1022)

pytorch - OverflowError: (34, 'Numerical result out of range') in PyTorch

2 回答 2

Related

Reference