在使用ray tune(1 gpu 进行 1 次试验)训练此代码期间,经过几个小时的训练(大约 20 次试验) ,GPU:0,1 出现错误。即使在终止训练过程后,GPUS 仍然会报错。CUDA out of memory
out of memory
如上所述,目前,我所有的 GPU 设备都是空的。除了这两个之外,没有其他 python 进程在运行。
import torch
torch.rand(1, 2).to('cuda:0') # cuda out of memory error
torch.rand(1, 2).to('cuda:1') # cuda out of memory error
torch.rand(1, 2).to('cuda:2') # working
torch.rand(1, 2).to('cuda:3') # working
torch.cuda.device_count() # 4
torch.cuda.memory_reserved() # 0
torch.cuda.is_available() # True
# error message of GPU 0, 1
RuntimeError: CUDA error: out of memory
但是,GPU:0,1 给出out of memory
错误。如果我重新启动计算机(ubunutu 18.04.3),它会恢复正常,但如果我再次训练代码,就会出现同样的问题。
我怎样才能调试这个问题,或者在不重新启动的情况下解决它?
- Ubuntu 18.04.3
- RTX 2080ti
- CUDA 10.2 版
- 英伟达驱动版本:460.27.04
- cudnn 7.6.4.38
- Python 3.8.4
- pytorch 1.7.0、1.9.0、1.9.0+cu111
- cpu:(AMD Ryzen Threadripper 2950X 16核处理器)x32
- 内存:125G
- 功率:2000W
- dmesg 结果。(没有
GPU has fallen off the bus
错误)
dmesg | grep -i -e nvidia -e nvrm
[ 5.946174] nvidia: loading out-of-tree module taints kernel.
[ 5.946181] nvidia: module license 'NVIDIA' taints kernel.
[ 5.956595] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 5.968280] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
[ 5.970485] nvidia 0000:09:00.0: enabling device (0000 -> 0003)
[ 5.970571] nvidia 0000:09:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[ 6.015145] nvidia 0000:0a:00.0: enabling device (0000 -> 0003)
[ 6.015394] nvidia 0000:0a:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[ 6.064993] nvidia 0000:42:00.0: enabling device (0000 -> 0003)
[ 6.065072] nvidia 0000:42:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[ 6.115778] nvidia 0000:43:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[ 6.164680] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 460.27.04 Fri Dec 11 23:35:05 UTC 2020
[ 6.174137] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 460.27.04 Fri Dec 11 23:24:19 UTC 2020
[ 6.176472] [drm] [nvidia-drm] [GPU ID 0x00000900] Loading driver
[ 6.176567] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:09:00.0 on minor 0
[ 6.176635] [drm] [nvidia-drm] [GPU ID 0x00000a00] Loading driver
[ 6.176636] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:0a:00.0 on minor 1
[ 6.176709] [drm] [nvidia-drm] [GPU ID 0x00004200] Loading driver
[ 6.176710] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:42:00.0 on minor 2
[ 6.176760] [drm] [nvidia-drm] [GPU ID 0x00004300] Loading driver
[ 6.176761] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:43:00.0 on minor 3
[ 6.189768] nvidia-uvm: Loaded the UVM driver, major device number 511.
[ 6.744582] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:40/0000:40:03.1/0000:43:00.1/sound/card4/input12
[ 6.744664] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:40/0000:40:03.1/0000:43:00.1/sound/card4/input15
[ 6.744755] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:40/0000:40:03.1/0000:43:00.1/sound/card4/input17
[ 6.744852] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:40/0000:40:03.1/0000:43:00.1/sound/card4/input19
[ 6.744952] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:40/0000:40:01.3/0000:42:00.1/sound/card3/input11
[ 6.745301] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:40/0000:40:01.3/0000:42:00.1/sound/card3/input16
[ 6.745739] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:40/0000:40:01.3/0000:42:00.1/sound/card3/input18
[ 6.746280] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:40/0000:40:01.3/0000:42:00.1/sound/card3/input20
[ 7.117377] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.3/0000:09:00.1/sound/card0/input9
[ 7.117453] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.3/0000:09:00.1/sound/card0/input10
[ 7.117505] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.3/0000:09:00.1/sound/card0/input13
[ 7.117559] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:01.3/0000:09:00.1/sound/card0/input14
[ 7.117591] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.1/0000:0a:00.1/sound/card1/input21
[ 7.117650] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.1/0000:0a:00.1/sound/card1/input22
[ 7.117683] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.1/0000:0a:00.1/sound/card1/input23
[ 7.117720] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.1/0000:0a:00.1/sound/card1/input24
[ 9.462521] caller os_map_kernel_space.part.8+0x74/0x90 [nvidia] mapping multiple BARs
- numba 和 tensorflow 有同样的问题,所以似乎不是因为 pytorch。
>>> from numba import cuda
>>> device = cuda.get_current_device()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/api.py", line 460, in get_current_device
return current_context().device
File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 212, in get_context
return _runtime.get_or_create_context(devnum)
File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 138, in get_or_create_context
return self._get_or_create_context_uncached(devnum)
File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 153, in _get_or_create_context_uncached
return self._activate_context_for(0)
File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 169, in _activate_context_for
newctx = gpu.get_primary_context()
File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 542, in get_primary_context
driver.cuDevicePrimaryCtxRetain(byref(hctx), self.id)
File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 302, in safe_cuda_api_call
self._check_error(fname, retcode)
File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 342, in _check_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuDevicePrimaryCtxRetain results in CUDA_ERROR_OUT_OF_MEMORY
更新
重新启动并将 pytorch 版本升级到 1.9.1+cu111 后似乎没有发生这种情况。