tensorflow - 谷歌合作实验室`ResourceExhaustedError`与GPU

Question

Vgg16我正在尝试使用对模型进行微调，colaboratory但在使用 GPU 进行训练时遇到了这个错误。

OOM when allocating tensor of shape [7,7,512,4096]

INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.ResourceExhaustedError'>, OOM when allocating tensor of shape [7,7,512,4096] and type float
     [[Node: vgg_16/fc6/weights/Momentum/Initializer/zeros = Const[_class=["loc:@vgg_16/fc6/weights"], dtype=DT_FLOAT, value=Tensor<type: float shape: [7,7,512,4096] values: [[[0 0 0]]]...>, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Caused by op 'vgg_16/fc6/weights/Momentum/Initializer/zeros', defined at:

我的 vm 会话也有这个输出：

    --- colab vm info ---
python v=3.6.3
tensorflow v=1.4.1
tf device=/device:GPU:0
model name  : Intel(R) Xeon(R) CPU @ 2.20GHz
model name  : Intel(R) Xeon(R) CPU @ 2.20GHz
MemTotal:       13341960 kB
MemFree:         1541740 kB
MemAvailable:   10035212 kB

我tfrecord的只有 118 个 256x256 JPGfile size <2MB

有解决方法吗？当我使用 CPU 时它可以工作，而不是 GPU

score 4 · Accepted Answer

看到少量空闲 GPU 内存几乎总是表明您创建了一个没有该allow_growth = True选项的 TensorFlow 会话。见： https ://www.tensorflow.org/guide/using_gpu#allowing_gpu_memory_growth

如果您不设置此选项，默认情况下，TensorFlow 将在创建会话时保留几乎所有 GPU 内存。

好消息：从本周开始，Colab 现在默认设置此选项，因此当您在 Colab 上使用多个笔记本时，您应该会看到低得多的增长。而且，您还可以通过从运行时菜单中选择“管理会话”来检查每个笔记本的 GPU 内存使用情况。

选择后，您将看到一个对话框，其中列出了所有笔记本和每个正在消耗的 GPU 内存。要释放内存，您也可以从此对话框终止运行时。

score 1 · Accepted Answer

我遇到了同样的问题，我发现我的问题是由下面的代码引起的：

from tensorflow.python.framework.test_util import is_gpu_available as tf
if tf()==True:
  device='/gpu:0'
else:
  device='/cpu:0'

我用下面的代码检查了GPU内存使用状态，运行上面的代码之前发现使用率为0%，运行后变为95%。

# memory footprint support libraries/code
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
!pip install psutil
!pip install humanize    
import psutil
import humanize
import os
import GPUtil as GPU
GPUs = GPU.getGPUs()
# XXX: only one GPU on Colab and isn't guaranteed
gpu = GPUs[0]

def printm():
process = psutil.Process(os.getpid())
print("Gen RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available ), " I Proc size: " + humanize.naturalsize( process.memory_info().rss))
print('GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB'.format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))

printm()

前：

Gen RAM Free：12.7 GB I Proc 大小：139.1 MB

GPU RAM 免费：11438MB | 已用：1MB | 使用率 0% | 总计 11439MB

后：

Gen RAM Free：12.0 GB I Proc 大小：1.0 GB

GPU RAM 免费：564MB | 已用：10875MB | 使用 95% | 总计 11439MB

不知何故，is_gpu_available() 托管消耗了大部分 GPU 内存而没有释放它们，因此，我使用下面的代码为我检测 gpu 状态，问题解决了

!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
try:
  import GPUtil as GPU
  GPUs = GPU.getGPUs()
  device='/gpu:0'
except:
  device='/cpu:0'

score 0 · Accepted Answer

我未能重现最初报告的错误，但如果这是由 GPU 内存（而不是主内存）耗尽引起的，这可能会有所帮助：

# See https://www.tensorflow.org/tutorials/using_gpu#allowing_gpu_memory_growth
config = tf.ConfigProto()
config.gpu_options.allow_growth = True

然后传递session_config=config给例如slim.learning.train()（或您最终使用的任何会话ctor）。

score 0 · Accepted Answer

在我的情况下，我没有使用 Ami 提供的解决方案来解决，即使它非常好，可能是因为 Colaboratory VM 无法提供更多资源。

我在检测阶段遇到了 OOM 错误（不是模型训练）。我解决了一个解决方法，禁用 GPU 进行检测：

config = tf.ConfigProto(device_count = {'GPU': 0})
sess = tf.Session(config=config)

tensorflow - 谷歌合作实验室`ResourceExhaustedError`与GPU

4 回答 4

Related

Reference