tensorflow - 如何在 nvidia-docker segfaulting 中调试 tensorflow？

Question

我在像这样在交互式环境中运行的 ubuntu 18.04 上：

docker run --runtime=nvidia -it --rm -v $PWD:/root/stuff -w /root tensorflow/tensorflow:latest-gpu-py3 bash

奇怪的是，当我以非交互方式运行时，我没有遇到段错误，即 docker run ... python stuff/mnist.py

英伟达详情：

$ nvidia-smi
Thu Nov 29 22:09:25 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 415.18       Driver Version: 415.18       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2070    Off  | 00000000:01:00.0  On |                  N/A |
| 30%   32C    P8    11W / 175W |    358MiB /  7949MiB |      3%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1471      G   /usr/lib/xorg/Xorg                            18MiB |
|    0      1523      G   /usr/bin/gnome-shell                          50MiB |
|    0      1919      G   /usr/lib/xorg/Xorg                           129MiB |
|    0      2063      G   /usr/bin/gnome-shell                         114MiB |
|    0      3762      G   ...quest-channel-token=2440404091774701506    43MiB |
+-----------------------------------------------------------------------------+



root@4a46cc9acb73:~# python -X faulthandler -vv stuff/mnist.py
Train on 60000 samples, validate on 10000 samples
Epoch 1/15
2018-11-29 22:06:26.371579: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-11-29 22:06:26.500120: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-11-29 22:06:26.500670: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.62
pciBusID: 0000:01:00.0
totalMemory: 7.76GiB freeMemory: 7.29GiB
2018-11-29 22:06:26.500686: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-11-29 22:06:26.723360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-29 22:06:26.723400: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2018-11-29 22:06:26.723407: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2018-11-29 22:06:26.723859: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7015 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
Fatal Python error: Segmentation fault

Thread 0x00007f82a1277700 (most recent call first):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1439 in __call__
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/backend.py", line 2986 in __call__
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/engine/training_arrays.py", line 215 in fit_loop
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/engine/training.py", line 1639 in fit
  File "stuff/mnist.py", line 36 in <module>
Segmentation fault (core dumped)

tensorflow - 如何在 nvidia-docker segfaulting 中调试 tensorflow？

0 回答 0

Related

Reference