0

我们的大学刚刚有了一个新集群。我正在尝试在其上训练多节点多 GPU 神经网络。现在,我得到了奇怪的gpu_mut and gpu_mused利用。它们彼此不一致。见下文 - 在此处输入图像描述

现在,为了弄清楚问题出在我的代码中还是硬件上,我尝试运行 horovod 的pytorch 基准测试示例,但我得到了

运行时错误:CUDA 内存不足。尝试分配 14.00 MiB(GPU 1;15.78 GiB 总容量;2.68 GiB 已分配;53.75 MiB 空闲;77.13 MiB 缓存;0 字节不活动)

我的大学 IT 支持团队告诉他们,看到 gpu 使用的不一致行为,错误不是来自他们这边。谁能告诉我如何处理,以便我得出一个结论,即错误是在我的代码中还是来自集群的硬件安排。环境是

**Environment:**
Framework: PyTorch
Framework version: 1.3.1
Horovod version: 0.19.0
MPI version: 10.3.0.01rc04
CUDA version: 10.1.105
NCCL version: 2.5.6
Python version: 3.7.9
OS and version: Red Hat Enterprise Linux Server" VERSION="7.6 (Maipo)"
GCC version: 4.8.5 20150623 (Red Hat 4.8.5-36)

此外,它是 IBM Power 9 集群。

完整的错误日志 -

[WARN DDL-2-17] 不执行连接测试。找不到“mpitool”可执行文件。这可能是因为您使用的 mpi 版本没有随 mpitool 一起提供。

Please see /tmp/DDLRUN/DDLRUN.ETfOvOUTgv5U/ddlrun.log for detailed log.
+ /home/rxs1576/.conda/envs/opt/bin/mpirun -x LD_LIBRARY_PATH -x LSB_JOBID -x LSB_MCPU_HOSTS -x PATH -disable_gdr -gpu -mca plm_rsh_num_concurrent 4 --rankfile /tmp/DDLRUN/DDLRUN.ETfOvOUTgv5U/RANKFILE -n 8 -x DDL_HOST_PORT=2200 -x "DDL_HOST_LIST=t035:0,1;t055:2,3;t056:4,5;t082:6,7" -x "DDL_OPTIONS=-mode p:2x4x1x1 " bash -c 'source /share/apps/ibm_wml_ce/1.6.2/anaconda3/etc/profile.d/conda.sh && conda activate /home/rxs1576/.conda/envs/opt > /dev/null 2>&1 && /home/rxs1576/latest_scripts/launch.sh python /home/rxs1576/latest_scripts/benchmark_pytorch.py'
Traceback (most recent call last):
  File "/home/rxs1576/latest_scripts/benchmark_pytorch.py", line 102, in <module>
    timeit.timeit(benchmark_step, number=args.num_warmup_batches)
  File "/home/rxs1576/.conda/envs/opt/lib/python3.7/timeit.py", line 233, in timeit
    return Timer(stmt, setup, timer, globals).timeit(number)
  File "/home/rxs1576/.conda/envs/opt/lib/python3.7/timeit.py", line 177, in timeit
    timing = self.inner(it, self.timer)
  File "<timeit-src>", line 6, in inner
  File "/home/rxs1576/latest_scripts/benchmark_pytorch.py", line 83, in benchmark_step
    output = model(data)
  File "/home/rxs1576/.conda/envs/opt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/rxs1576/.conda/envs/opt/lib/python3.7/site-packages/torchvision/models/resnet.py", line 204, in forward
    x = self.layer4(x)
  File "/home/rxs1576/.conda/envs/opt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/rxs1576/.conda/envs/opt/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/rxs1576/.conda/envs/opt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/rxs1576/.conda/envs/opt/lib/python3.7/site-packages/torchvision/models/resnet.py", line 110, in forward
    identity = self.downsample(x)
  File "/home/rxs1576/.conda/envs/opt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/rxs1576/.conda/envs/opt/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/rxs1576/.conda/envs/opt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/rxs1576/.conda/envs/opt/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 345, in forward
    return self.conv2d_forward(input, self.weight)
  File "/home/rxs1576/.conda/envs/opt/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
    self.padding, self.dilation, self.groups)
RuntimeError: CUDA out of memory. Tried to allocate 14.00 MiB (GPU 1; 15.78 GiB total capacity; 2.68 GiB already allocated; 53.75 MiB free; 77.13 MiB cached; 0 bytes inactive)
Traceback (most recent call last):
  File "/home/rxs1576/latest_scripts/benchmark_pytorch.py", line 102, in <module>
    timeit.timeit(benchmark_step, number=args.num_warmup_batches)
  File "/home/rxs1576/.conda/envs/opt/lib/python3.7/timeit.py", line 233, in timeit
    return Timer(stmt, setup, timer, globals).timeit(number)
  File "/home/rxs1576/.conda/envs/opt/lib/python3.7/timeit.py", line 177, in timeit
    timing = self.inner(it, self.timer)
  File "<timeit-src>", line 6, in inner
  File "/home/rxs1576/latest_scripts/benchmark_pytorch.py", line 83, in benchmark_step
    output = model(data)
  File "/home/rxs1576/.conda/envs/opt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/rxs1576/.conda/envs/opt/lib/python3.7/site-packages/torchvision/models/resnet.py", line 201, in forward
    x = self.layer1(x)
  File "/home/rxs1576/.conda/envs/opt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/rxs1576/.conda/envs/opt/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/rxs1576/.conda/envs/opt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/rxs1576/.conda/envs/opt/lib/python3.7/site-packages/torchvision/models/resnet.py", line 106, in forward
    out = self.conv3(out)
  File "/home/rxs1576/.conda/envs/opt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
"TEST_350029.err" 197L, 16215C                                                                                                                                                            1,1           Top



1

4

0 回答 0