tensorflow - Tensorflow 对象检测 API，仅在 CPU 上运行，出现错误忽略设备规范 /device:GPU:0 for node 'prefetch_queue_Dequeue'

Question

当我开始训练时，这个过程只发生在 CPU 上。这是我在启动时收到的消息：

忽略节点 'prefetch_queue_Dequeue' 的设备规范 /device:GPU:0 因为来自 'prefetch_queue' 的输入边是参考连接并且已经将设备字段设置为 /device:CPU:0

并且当 GPU 内存被填满时，只有 GPU 负载的突然峰值，并且大部分是 0%。不用说，性能非常缓慢。

似乎所有内容都加载到 CPU 而不是 GPU，因为 CPU 利用率接近 100%：

我还注意到在 ubuntu 中存在同样的问题，但它至少比 windows 分支快 4 倍（每个步骤需要 400 毫秒，而在 windows 中需要 1300 毫秒）
我在 Ubuntu（14.04）和Windows 和它们都使用pip install --upgrade tensorflow-gpu command
以下是整个日志安装：

G:\Tensorflow_section\models-master\object_detection>python train.py  --logtostderr --train_dir=training_stuff --pipeline_config_path=ssd_mobilenet_v1_pets.config
INFO:tensorflow:Summary name Learning Rate is illegal; using Learning_Rate instead.
WARNING:tensorflow:From C:\Users\Master\Anaconda3\envs\anaconda35\lib\site-packages\object_detection-0.1-py3.5.egg\object_detection\meta_architectures\ssd_meta_arch.py:607: all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Please use tf.global_variables instead.
INFO:tensorflow:Summary name /clone_loss is illegal; using clone_loss instead.
2017-09-18 03:44:08.545358: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-09-18 03:44:08.545474: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-18 03:44:09.121357: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:955] Found device 0 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.835
pciBusID 0000:01:00.0
Total memory: 8.00GiB
Free memory: 6.63GiB
2017-09-18 03:44:09.121483: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:976] DMA: 0
2017-09-18 03:44:09.122196: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:986] 0:   Y
2017-09-18 03:44:09.133158: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)
INFO:tensorflow:Restoring parameters from training_stuff\model.ckpt-0
2017-09-18 03:44:15.528390: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\simple_placer.cc:697] Ignoring device specification /device:GPU:0 for node 'prefetch_queue_Dequeue' because the input edge from 'prefetch_queue' is a reference connection and already has a device field set to /device:CPU:0
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path training_stuff\model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global step 1: loss = 20.1465 (18.034 sec/step)
INFO:tensorflow:global step 2: loss = 15.8647 (1.601 sec/step)
INFO:tensorflow:global step 3: loss = 13.3987 (1.540 sec/step)
INFO:tensorflow:global step 4: loss = 11.5424 (1.562 sec/step)
INFO:tensorflow:global step 5: loss = 10.8328 (1.337 sec/step)
INFO:tensorflow:global step 6: loss = 10.7179 (1.317 sec/step)
INFO:tensorflow:global step 7: loss = 9.7616 (1.369 sec/step)
INFO:tensorflow:global step 8: loss = 8.5631 (1.336 sec/step)
INFO:tensorflow:global step 9: loss = 7.2683 (1.384 sec/step)

出了什么问题以及如何解决这个问题？
顺便说一下，这里有一些补充信息：

操作系统平台和分发：Windows 10 x64 1703, Build 15063.540
TensorFlow 安装自（源代码或二进制文件）：binary (used pip install )
TensorFlow 版本（使用下面的命令）：1.3.0
蟒蛇版本：3.5.3
CUDA/cuDNN 版本：Cuda 8.0 /cudnn v6.0
GPU型号和内存：GTX-1080 - 8G

更新：
评论部分建议片段的输出如下：

import tensorflow as tf

sess = tf.InteractiveSession()
with tf.device('/cpu:0'):
    q = tf.FIFOQueue(3, 'float')
    add_op = q.enqueue(42)
with tf.device('/gpu:0'):
    get_op = q.dequeue()

sess.run(add_op)
print( sess.run(get_op))

输出：

2017-09-17 23:30:59.539728: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-09-17 23:30:59.539857: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-17 23:30:59.856904: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:955] Found device 0 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.835
pciBusID 0000:01:00.0
Total memory: 8.00GiB
Free memory: 6.63GiB
2017-09-17 23:30:59.857034: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:976] DMA: 0
2017-09-17 23:30:59.858320: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:986] 0:   Y
2017-09-17 23:30:59.858688: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)
2017-09-17 23:30:59.879245: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\simple_placer.cc:697] Ignoring device specification /device:GPU:0 for node 'fifo_queue_Dequeue' because the input edge from 'fifo_queue' is a reference connection and already has a device field set to /job:localhost/replica:0/task:0/device:CPU:0
42.0

score 2 · Accepted Answer

好的，我也进行了一些测试Ubuntu，这就是我发现的结果：
计算运行在GPU上面，Ubuntu并且与Windows（Tensorflow使用 pip 安装的 1.3）相比，它快 4 到 6 倍。
它在核心上使用GPU和CPU负载平衡CPU似乎相当分散，但是，Windows 版本并非如此。
在Windows它下面似乎确实在使用，GPU因为它占用了 7.1 GB 的空间VRAM，但是负载CPU很疯狂！与微不足道的 GPU 负载不同，所有内核都已达到极限，从而扼杀了整个系统的性能。
进一步的调查使我遇到了类似的问题：在 Windows 上使用 GPU 进行对象检测比在 Ubuntu 上慢约 5 倍这家伙说这是因为 Windows 上的一些依赖问题，自 1.2 版以来，Tensorflow 没有对此做任何事情！
所以底线是，如果您打算使用Object Detection API，并打算GPU有效地利用您的资源，Ubuntu请在那里使用和运行您的实验。Tensorflow似乎根本不在乎Windows！

tensorflow - Tensorflow 对象检测 API，仅在 CPU 上运行，出现错误忽略设备规范 /device:GPU:0 for node 'prefetch_queue_Dequeue'

1 回答 1

Related

Reference