tensorflow - 除非 GPU 成功加载，否则 TensorFlow 如何不运行脚本？

Question

我一直在尝试在某些带有 GPU 的机器上运行一些 TensorFlow 训练，但是，每当我尝试这样做时，我都会收到某种类型的错误，似乎表明它由于某种原因无法使用 GPU（通常是内存问题，或者cuda 问题或 cudnn 等）。但是，由于 TensorFlow 自动执行的操作是在不能使用 GPU 的情况下仅在 CPU 中运行，因此我很难判断它是否真的能够利用 GPU。因此，除非正在使用 GPU，否则我想让我的脚本失败/停止。我怎么做？

举个例子，目前我有消息：

I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: Tesla P100-SXM2-16GB
major: 6 minor: 0 memoryClockRate (GHz) 1.4805
pciBusID 0000:85:00.0
Total memory: 15.93GiB
Free memory: 15.63GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-SXM2-16GB, pci bus id: 0000:85:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: Tesla P100-SXM2-16GB
major: 6 minor: 0 memoryClockRate (GHz) 1.4805
pciBusID 0000:85:00.0
Total memory: 15.93GiB
Free memory: 522.25MiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-SXM2-16GB, pci bus id: 0000:85:00.0)
E tensorflow/stream_executor/cuda/cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
E tensorflow/stream_executor/cuda/cuda_dnn.cc:352] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
F tensorflow/core/kernels/conv_ops.cc:532] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms)

它似乎加载了所有的cuda罚款，但最后抱怨。抱怨的线路是：

E tensorflow/stream_executor/cuda/cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
E tensorflow/stream_executor/cuda/cuda_dnn.cc:352] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
F tensorflow/core/kernels/conv_ops.cc:532] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms)

我们可以尝试调试这些特定的错误，但是在它继续训练的那一刻，我不知道它是使用 cpu 还是 gpu。如果出现任何奇怪的 cuda/cudnn 或任何 gpu 错误，我们可以让它不继续训练吗？

score 1 · Accepted Answer

使用with tf.device('/gpu:0'):. 如果/gpu:0不存在，这将杀死您的程序。

例如见https://github.com/hughperkins/tensorflow-cl/blob/tensorflow-cl/tensorflow/stream_executor/cl/test/test_binary_ops.py#L52

with tf.Graph().as_default():
    with tf.Session(config=tf.ConfigProto(log_device_placement=False)) as sess:
        with tf.device('/gpu:0'):
            tf_a = tf.placeholder(tf_dtype, [None, None], 'a')
            tf_b = tf.placeholder(tf_dtype, [None, None], 'b')
            tf_c = tf.__dict__[tf_func](tf_a, tf_b, name="c")

score 0 · Accepted Answer

您可以列出 tensorflow 中所有可用的设备：如何在 tensorflow 中获取当前可用的 GPU？. 如果 GPU 不在列表中，您可以让程序抛出异常。

tensorflow - 除非 GPU 成功加载，否则 TensorFlow 如何不运行脚本？

2 回答 2

Related

Reference