tensorflow - GPU 上的 Tensorflow 比 CPU 上慢

Question

使用带有 Tensorflow 后端的 Keras，我正在尝试训练 LSTM 网络，在 GPU 上运行它比在 CPU 上运行它需要更长的时间。

我正在使用 fit_generator 函数训练 LSTM 网络。每个 epoch 需要 CPU ~250 秒，而 GPU 每个 epoch 需要 ~900 秒。我的 GPU 环境中的包包括

keras-applications        1.0.8                      py_0    anaconda
keras-base                2.2.4                    py36_0    anaconda
keras-gpu                 2.2.4                         0    anaconda
keras-preprocessing       1.1.0                      py_1    anaconda
...
tensorflow                1.13.1          gpu_py36h3991807_0    anaconda
tensorflow-base           1.13.1          gpu_py36h8d69cac_0    anaconda
tensorflow-estimator      1.13.0                     py_0    anaconda
tensorflow-gpu            1.13.1                   pypi_0    pypi

我的 Cuda 编译工具是 9.1.85 版本，我的 CUDA 和驱动程序版本是

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2080    On   | 00000000:0A:00.0 Off |                  N/A |
|  0%   39C    P8     5W / 225W |   7740MiB /  7952MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 2080    On   | 00000000:42:00.0 Off |                  N/A |
|  0%   33C    P8    19W / 225W |    142MiB /  7951MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     49251      C   .../whsu014/.conda/envs/whsuphd/bin/python  7729MiB |
|    1      1354      G   /usr/lib/xorg/Xorg                            16MiB |
|    1     49251      C   .../whsu014/.conda/envs/whsuphd/bin/python   113MiB |
+-----------------------------------------------------------------------------+

当我插入这行代码时

tf.Session(config = tf.configProto(log_device_placement = True)):

我在终端中看到以下内容

...
ining_1/Adam/Const_10: (Const)/job:localhost/replica:0/task:0/device:GPU:0
training_1/Adam/Const_11: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2019-06-25 11:27:31.720653: I tensorflow/core/common_runtime/placer.cc:1059] training_1/Adam/Const_11: (Const)/job:localhost/replica:0/task:0/device:GPU:0
training_1/Adam/add_15/y: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2019-06-25 11:27:31.720666: I tensorflow/core/common_runtime/placer.cc:1059] training_1/Adam/add_15/y: (Const)/job:localhost/replica:0/task:0/device:GPU:0
...

因此，Tensorflow 似乎正在使用 GPU。

当我分析代码时，在 GPU 上这是前 10 行

10852017 function calls (10524203 primitive calls) in 184.768 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    16200  173.827    0.011  173.827    0.011 {built-in method _pywrap_tensorflow_internal.TF_SessionRunCallable}
        6    0.926    0.154    0.926    0.154 {built-in method _pywrap_tensorflow_internal.TF_SessionMakeCallable}
       62    0.813    0.013    0.813    0.013 {built-in method _pywrap_tensorflow_internal.TF_SessionRun_wrapper}
   156954    0.414    0.000    0.415    0.000 {built-in method numpy.array}
    16200    0.379    0.000    1.042    0.000 training.py:643(_standardize_user_data)
    24300    0.338    0.000    0.338    0.000 {method 'partition' of 'numpy.ndarray' objects}
       68    0.301    0.004    0.301    0.004 {built-in method _pywrap_tensorflow_internal.ExtendSession}
    32458    0.223    0.000    2.122    0.000 tensorflow_backend.py:156(get_session)
     3206    0.212    0.000    0.238    0.000 tf_stack.py:31(extract_stack)
    76024    0.210    0.000    0.702    0.000 ops.py:5246(get_controller)
...

在 CPU 上，这是前 10 行

22123473 function calls (21647174 primitive calls) in 60.173 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    16269   42.491    0.003   42.491    0.003 {built-in method tensorflow.python._pywrap_tensorflow_internal.TF_Run}
    16269    0.568    0.000   48.964    0.003 session.py:1042(_run)
       56    0.532    0.010    0.532    0.010 {built-in method time.sleep}
   153641    0.458    0.000    0.460    0.000 {built-in method numpy.core.multiarray.array}
183148/125354    0.447    0.000    1.316    0.000 python_message.py:469(init)
  1226659    0.362    0.000    0.364    0.000 {built-in method builtins.getattr}
2302110/2301986    0.339    0.000    0.358    0.000 {built-in method builtins.isinstance}
        8    0.285    0.036    0.285    0.036 {built-in method tensorflow.python._pywrap_tensorflow_internal.TF_ExtendGraph}
    12150    0.267    0.000    0.271    0.000 callbacks.py:211(on_batch_end)
147026/49078    0.264    0.000    1.429    0.000 python_message.py:1008(ByteSize)
...

这是我的代码。

def train_generator(x_list, y_list):
    # 0.1 validatioin split
    train_length = (len(x_list)//10)*9
    while True:
        for i in range(train_length):
            train_x = np.array([x_list[i]])
            train_y = np.array([y_list[i]])
            yield train_x, train_y

def val_generator(x_list, y_list):
    # 0.1 validation split
    val_length = len(x_list)//10
    while True:
        for i in range(-val_length, 0, 1):
            val_x = np.array([x_list[i]])
            val_y = np.array([y_list[i]])
            yield val_x, val_y



with tf.Session(config = tf.ConfigProto(log_device_placement = True)):
model = Sequential()
model.add(LSTM(64, return_sequences=False,
               input_shape=(None, 24)))
model.add(Dense(1))
model.compile(loss='mae', optimizer='adam')
checkpointer = ModelCheckpoint(filepath="weights.hdf5",
                               monitor='val_loss', verbose=1,
                               save_best_only=True)

history = model.fit_generator(generator=train_generator(train_x,
                                                        train_y),
                              steps_per_epoch=(len(train_x)//10)*9,
                              epochs=5,
                              validation_data=val_generator(train_x,
                                                            train_y),
                              validation_steps=len(train_x)//10,
                              callbacks=[checkpointer],
                              verbose=2, shuffle=False)
# plot history
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='validation')
pyplot.legend()
pyplot.show()

我希望在使用 GPU 进行训练时显着加快速度。我怎样才能解决这个问题？有人可以帮助我了解导致减速的原因吗？谢谢你。

score 7 · Accepted Answer

几个观察：

使用CuDNNLSTM而不是LSTM在 GPU 上进行训练，您将看到速度显着提高。
有时，对于非常小的网络，在 CPU 和 GPU 之间传输的开销超过了在 GPU 上进行的并行计算；换句话说，传输数据所损失的时间比在 GPU 上训练所获得的时间要多。

GPU 应该用于高度密集的任务和计算（非常大的 LSTM/重型 CNN 网络）。然而，对于非常小的 MLP 甚至是小的 LSTM，您可能会观察到网络在 CPU 和 GPU 上的训练速度同样快，或者在某些特定情况下，CPU 上的速度甚至更好（非常特殊的超小型网络情况）。

更新 TensorFlow >= 2.0

如果检测到显卡，导入默认使用CuDNNLSTM/ ；CuDNNGRU因此不需要明确地导入它们。

tensorflow - GPU 上的 Tensorflow 比 CPU 上慢

1 回答 1

Related

Reference