tensorflow - 具有 2 个 GPU 的 TensorFlow 忽略其中一个

Question

我在通过 SLI 技术连接的 2 个 GPU 上计算 tensorflow 时遇到问题：只有一个在工作，第二个没有，尽管两个 GPU 都被 TF 识别。

设置： - Ubuntu 18.04 - Python 3 - Tensorflow 2.1 - Cuda 10.1 - Nvidia 驱动程序（官方）440.64 - AMD Ryzen 2700 - Asus x470 prime - 通过 SLI 技术连接的两个 GTX 1070 的 GPU。

我已经测试了我在互联网上找到的许多东西。具体来说：

我从 Tensorflow 2.0 开始，它不起作用，所以我将它更新到 TF 2.1。问题依旧
清除并重新安装 Nvidia 驱动程序 430.50。将它们更新为 440.64。问题依旧
我分别验证了我的每个 GPU。我从物理上删除了其中一个，并在其余的上启动了代码。它起作用了，而且 GPU 似乎还可以。
我分别验证了主板上的每个 GPU 端口。它起作用了，这意味着每个端口都很好。
我插入了两个带和不带硬件 SLI 连接的 GPU，并启动了以下代码：

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.applications import Xception
import numpy as np

num_samples = 100
height = 224
width = 224
num_classes = 50

strategy = tf.distribute.MirroredStrategy(devices=['/GPU:0', '/GPU:1'])
with strategy.scope():
    parallel_model = Xception(weights=None,
                              input_shape=(height, width, 3),
                              classes=num_classes)
    parallel_model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

### Works only for the first GPU of the 
# parallel_model = Xception(weights=None,
#                           input_shape=(height, width, 3),
#                           classes=num_classes)
# parallel_model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

# Generate dummy data.
x = np.random.random((num_samples, height, width, 3))
y = np.random.random((num_samples, num_classes))

parallel_model.summary()
# This `fit` call will be distributed on 8 GPUs.
# Since the batch size is 256, each GPU will process 32 samples.
parallel_model.fit(x, y, epochs=20, batch_size=16)

结果，当时strategy = tf.distribute.MirroredStrategy(devices=['/GPU:0'])，代码运行良好。但是，当devices=['/GPU:1']或时devices=['/GPU:0', '/GPU:1']，nvidia-smi 在第二个 GPU 上显示一些进程，但代码执行堆叠在行

2020-03-28 21:51:14.891325: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7162 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:08:00.0, compute capability: 6.1)
2020-03-28 21:51:14.891805: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-28 21:51:14.892399: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 7624 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1070, pci bus id: 0000:09:00.0, compute capability: 6.1),

所以我必须重新启动计算机，因为它已经死了。

最初，我的 X11 配置 (xorg.conf) 没有为 SLI 配置：

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
EndSection

Section "Device"
    Identifier     "Device1"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
EndSection

Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
    Monitor        "Monitor0"
    DefaultDepth    24
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection

谷歌搜索后，我玩了sudo nvidia-xconfig -sli=on；sudo nvidia-xconfig -sli=auto， ETC

结果，重新启动后，我获得了一个带有 2 行的引导循环：

recovering journal
/dev/nume0n1p2: clean, XXX/XXX files, XXX/XXX blocks

每隔约 3 秒，屏幕变黑，然后这两条线再次显示。无法访问 TTY，因为它也在引导循环中。我查看了有关此主题的所有内容，但没有任何效果。所以，我保留了没有 SLI 的以前的 X11 配置

如果您遇到此类问题，请随时分享。任何建议都会有所帮助。

谢谢！

tensorflow - 具有 2 个 GPU 的 TensorFlow 忽略其中一个

0 回答 0

Related

Reference