tensorflow - tf.Session() 上的分段错误（核心转储）

Question

我是 TensorFlow 的新手。

我刚刚安装了 TensorFlow 并测试安装，我尝试了以下代码，一旦我启动 TF 会话，我就会收到分段错误（核心转储）错误。

bafhf@remote-server:~$ python
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
/home/bafhf/anaconda3/envs/ismll/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
>>> tf.Session()
2018-05-15 12:04:15.461361: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1349] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:04:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
Segmentation fault (core dumped)

我的nvidia-smi是：

Tue May 15 12:12:26 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30                 Driver Version: 390.30                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:04:00.0 Off |                    0 |
| N/A   38C    P8    26W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 00000000:05:00.0 Off |                    2 |
| N/A   31C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

而nvcc --version是：

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

还有gcc --version是：

gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

以下是我的PATH：

/home/bafhf/bin:/home/bafhf/.local/bin:/usr/local/cuda/bin:/usr/local/cuda/lib:/usr/local/cuda/extras/CUPTI/lib:/home/bafhf/anaconda3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

和LD_LIBRARY_PATH：

/usr/local/cuda/bin:/usr/local/cuda/lib:/usr/local/cuda/extras/CUPTI/lib

我在服务器上运行它，我没有 root 权限。我仍然按照官方网站上的说明安装了所有东西。

编辑：新观察：

似乎 GPU 正在为该进程分配内存一秒钟，然后抛出核心分段转储错误：

Edit2：更改了 tensorflow 版本

我将我的 tensorflow 版本从 v1.8 降级到 v1.5。问题仍然存在。

有没有办法解决或调试这个问题？

score 1 · Accepted Answer

由于您在此处使用多个 GPU，因此可能会发生这种情况。尝试将 cuda 可见设备设置为仅其中一个 GPU。有关如何执行此操作的说明，请参阅此链接。就我而言，这解决了问题。

score 1 · Accepted Answer

如果您可以看到nvidia-smi输出，则第二个 GPU 的ECC代码为 2。无论 CUDA 版本或 TF 版本错误如何，此错误都会表现出来，通常是段错误，有时，CUDA_ERROR_ECC_UNCORRECTABLE堆栈跟踪中的标志.

我从这篇文章中得出了这个结论：

“不可纠正的 ECC 错误”通常是指硬件故障。ECC 是纠错码，一种检测和纠正存储在 RAM 中的位中的错误的方法。一条杂散的宇宙射线可能每隔一段时间就会破坏存储在 RAM 中的一个位，但“不可纠正的 ECC 错误”表示从 RAM 存储中出来的几个位“错误”——ECC 无法恢复原始位值太多。

这可能意味着您的 GPU 设备内存中有一个坏的或边缘的 RAM 单元。

任何类型的边际电路都可能不会 100% 失效，但在大量使用的压力下更可能发生故障 - 以及相关的温度升高。

通常应该重新启动以消除ECC错误。如果没有，似乎唯一的选择是更改硬件。

那么我做了什么，最后我是如何解决这个问题的？

我在带有 NVIDIA 1050 Ti 机器的单独机器上测试了我的代码，我的代码执行得非常好。
我只在ECC 值正常的第一张卡上运行代码，只是为了缩小问题范围。这是我在这篇文章之后所做的，设置了 CUDA_VISIBLE_DEVICES环境变量。
然后我请求重新启动Tesla-K80 服务器以检查重新启动是否可以解决此问题，他们花了一段时间但服务器随后重新启动

现在问题不再存在，我可以为我的张量流实现运行两张卡。

score 1 · Accepted Answer

如果有人仍然感兴趣，我碰巧遇到了同样的问题，“Volatile Uncorr.ECC”输出。我的问题是版本不兼容，如下所示：

已加载运行时 CuDNN 库：7.1.1，但源代码编译为：7.2.1。在 CuDNN 7.0 或更高版本的情况下，CuDNN 库主要和次要版本需要匹配或具有更高的次要版本。如果使用二进制安装，请升级您的 CuDNN 库。如果从源代码构建，请确保在运行时加载的库与编译配置期间指定的版本兼容。分段故障

在我将 CuDNN 库升级到 7.3.1（大于 7.2.1）后，分段错误错误消失了。为了升级，我做了以下事情（也记录在这里）。

从NVIDIA 网站下载 CuDNN 库
须藤焦油 -xzvf [TAR_FILE]
sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

score 1 · Accepted Answer

我也面临同样的问题。我有一个解决方法，你可以试试。

我按照以下步骤操作：1.重新安装python 3.5或更高版本2.重新安装Cuda并将Cudnn库添加到其中。3.重新安装Tensorflow 1.8.0 GPU版本。

score 0 · Accepted Answer

我最近遇到这个问题。

原因是 docker 容器中有多个 GPU。解决方案非常简单，您可以：

CUDA_VISIBLE_DEVICES在主机中设置是指https://stackoverflow.com/a/50464695/2091555

或者

如果您需要多个 GPU，则用于--ipc=host启动 docker，例如

docker run --runtime nvidia --ipc host \
  --rm -it
  nvidia/cuda:10.0-cudnn7-runtime-ubuntu16.04:latest

这个问题实际上非常讨厌，并且cuInit()在 docker 容器中调用期间会发生段错误，并且在主机中一切正常。我将在此处留下日志，以让搜索引擎更容易为其他人找到此答案。

(base) root@e121c445c1eb:~# conda install pytorch torchvision cudatoolkit=10.0 -c pytorch
Collecting package metadata (current_repodata.json): / Segmentation fault (core dumped)

(base) root@e121c445c1eb:~# gdb python /data/corefiles/core.conda.572.1569384636
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.

warning: core file may not match specified executable file.
[New LWP 572]
[New LWP 576]

warning: Unexpected size of section `.reg-xstate/572' in core file.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/opt/conda/bin/python /opt/conda/bin/conda upgrade conda'.
Program terminated with signal SIGSEGV, Segmentation fault.

warning: Unexpected size of section `.reg-xstate/572' in core file.
#0  0x00007f829f0a55fb in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
[Current thread is 1 (Thread 0x7f82bbfd7700 (LWP 572))]
(gdb) bt
#0  0x00007f829f0a55fb in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
#1  0x00007f829f06e3a5 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
#2  0x00007f829f07002c in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
#3  0x00007f829f0e04f7 in cuInit () from /usr/lib/x86_64-linux-gnu/libcuda.so
#4  0x00007f82b99a1ec0 in ffi_call_unix64 () from /opt/conda/lib/python3.7/lib-dynload/../../libffi.so.6
#5  0x00007f82b99a187d in ffi_call () from /opt/conda/lib/python3.7/lib-dynload/../../libffi.so.6
#6  0x00007f82b9bb7f7e in _call_function_pointer (argcount=1, resmem=0x7ffded858980, restype=<optimized out>, atypes=0x7ffded858940, avalues=0x7ffded858960, pProc=0x7f829f0e0380 <cuInit>, 
    flags=4353) at /usr/local/src/conda/python-3.7.3/Modules/_ctypes/callproc.c:827
#7  _ctypes_callproc () at /usr/local/src/conda/python-3.7.3/Modules/_ctypes/callproc.c:1184
#8  0x00007f82b9bb89b4 in PyCFuncPtr_call () at /usr/local/src/conda/python-3.7.3/Modules/_ctypes/_ctypes.c:3969
#9  0x000055c05db9bd2b in _PyObject_FastCallKeywords () at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:199
#10 0x000055c05dbf7026 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:4619
#11 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:3124
#12 0x000055c05db9a79b in function_code_fastcall (globals=<optimized out>, nargs=0, args=<optimized out>, co=<optimized out>)
    at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:283
#13 _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:408
#14 0x000055c05dbf2846 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:4616
#15 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:3124
... (stack omitted)
#46 0x000055c05db9aa27 in _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:433
---Type <return> to continue, or q <return> to quit---q
Quit

另一种尝试是使用 pip 安装

(base) root@e121c445c1eb:~# pip install torch torchvision
(base) root@e121c445c1eb:~# python
Python 3.7.3 (default, Mar 27 2019, 22:11:17) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
Segmentation fault (core dumped)

(base) root@e121c445c1eb:~# gdb python /data/corefiles/core.python.28.1569385311 
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.

warning: core file may not match specified executable file.
[New LWP 28]

warning: Unexpected size of section `.reg-xstate/28' in core file.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
bt
Core was generated by `python'.
Program terminated with signal SIGSEGV, Segmentation fault.

warning: Unexpected size of section `.reg-xstate/28' in core file.
#0  0x00007ffaa1d995fb in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
(gdb) bt
#0  0x00007ffaa1d995fb in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#1  0x00007ffaa1d623a5 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007ffaa1d6402c in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007ffaa1dd44f7 in cuInit () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007ffaee75f724 in cudart::globalState::loadDriverInternal() () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#5  0x00007ffaee760643 in cudart::__loadDriverInternalUtil() () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#6  0x00007ffafe2cda99 in __pthread_once_slow (once_control=0x7ffaeebe2cb0 <cudart::globalState::loadDriver()::loadDriverControl>, 
... (stack omitted)

score 0 · Accepted Answer

检查您使用的 CUDA 和 CuDNN 版本是否与 tensorflow 要求的完全相同，并且您使用的是该 CUDA 版本随附的显卡驱动程序版本。

我曾经遇到过类似的问题，驱动程序太新了。将它降级到 tensorflow 所需的 CUDA 版本附带的版本为我解决了这个问题。

score -1 · Accepted Answer

我在纸空间的云环境中使用 tensorflow。

cuDNN 7.3.1 的更新对我不起作用。

一种方法是构建具有适当 GPU 和 CPU 支持的 Tensorflow。

这不是正确的解决方案，但这暂时解决了我的问题（将 tensoflow 降级到 1.5.0）：

pip uninstall tensorflow-gpu
pip install tensorflow==1.5.0
pip install numpy==1.14.0
pip install six==1.10.0
pip install joblib==0.12

希望这可以帮助！

tensorflow - tf.Session() 上的分段错误（核心转储）

7 回答 7

那么我做了什么，最后我是如何解决这个问题的？

Related

Reference