0

我一直在尝试在某些带有 GPU 的机器上运行一些 TensorFlow 训练,但是,每当我尝试这样做时,我都会收到某种类型的错误,似乎表明它由于某种原因无法使用 GPU(通常是内存问题,或者cuda 问题或 cudnn 等)。但是,由于 TensorFlow 自动执行的操作是在不能使用 GPU 的情况下仅在 CPU 中运行,因此我很难判断它是否真的能够利用 GPU。因此,除非正在使用 GPU,否则我想让我的脚本失败/停止。我怎么做?


举个例子,目前我有消息:

I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: Tesla P100-SXM2-16GB
major: 6 minor: 0 memoryClockRate (GHz) 1.4805
pciBusID 0000:85:00.0
Total memory: 15.93GiB
Free memory: 15.63GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-SXM2-16GB, pci bus id: 0000:85:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: Tesla P100-SXM2-16GB
major: 6 minor: 0 memoryClockRate (GHz) 1.4805
pciBusID 0000:85:00.0
Total memory: 15.93GiB
Free memory: 522.25MiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-SXM2-16GB, pci bus id: 0000:85:00.0)
E tensorflow/stream_executor/cuda/cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
E tensorflow/stream_executor/cuda/cuda_dnn.cc:352] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
F tensorflow/core/kernels/conv_ops.cc:532] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms)

它似乎加载了所有的cuda罚款,但最后抱怨。抱怨的线路是:

E tensorflow/stream_executor/cuda/cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
E tensorflow/stream_executor/cuda/cuda_dnn.cc:352] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
F tensorflow/core/kernels/conv_ops.cc:532] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms)

我们可以尝试调试这些特定的错误,但是在它继续训练的那一刻,我不知道它是使用 cpu 还是 gpu。如果出现任何奇怪的 cuda/cudnn 或任何 gpu 错误,我们可以让它不继续训练吗?

4

2 回答 2

1

使用with tf.device('/gpu:0'):. 如果/gpu:0不存在,这将杀死您的程序。

例如见https://github.com/hughperkins/tensorflow-cl/blob/tensorflow-cl/tensorflow/stream_executor/cl/test/test_binary_ops.py#L52

with tf.Graph().as_default():
    with tf.Session(config=tf.ConfigProto(log_device_placement=False)) as sess:
        with tf.device('/gpu:0'):
            tf_a = tf.placeholder(tf_dtype, [None, None], 'a')
            tf_b = tf.placeholder(tf_dtype, [None, None], 'b')
            tf_c = tf.__dict__[tf_func](tf_a, tf_b, name="c")
于 2017-03-25T12:21:32.150 回答
0

您可以列出 tensorflow 中所有可用的设备:如何在 tensorflow 中获取当前可用的 GPU?. 如果 GPU 不在列表中,您可以让程序抛出异常。

于 2017-02-22T23:15:18.417 回答