python - 在 CentOS 7 上使用 Tensorflow 的零星核心转储（分段错误、非法指令）

Question

在使用 tensorflow 训练模型时，我经常遇到核心转储（非法指令、分段错误）。它们有些零星，但随着模型架构变得更加复杂（更多节点，更多层），频率似乎会增加。

我得到了以下设置：
CentOS 7 CUDA Tooklit 版本 8
cuDNN 版本 5.1
tensorflow-gpu 版本 1.0.0 由 pip 安装

所有的环境路径都设置好了，tensorflow 似乎可以识别并拾取 GPU、CUDA 和必要的库......

import tensorflow as tf  
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally

第一个错误发生在我尝试构建多层深度网络时，几乎每次都会失败。所以我从头开始学习 tensorflow 教程并尝试了一些更简单的东西，这些东西似乎有效......但并非总是如此。

因此，作为一个小型实验，我从 tensorflows 网站上的 MNIST 数据教程中选取了两个复杂度不同的模型，并稍作修改。一个是简单的 softmax 回归模型，保存为softmax.py如下所示：

from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)
config = tf.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.InteractiveSession(config=config)
x = tf.placeholder(tf.float32, shape=[None, 784])
y_ = tf.placeholder(tf.float32, shape=[None, 10])
W = tf.Variable(tf.zeros([784,10]))
b = tf.Variable(tf.zeros([10]))
sess.run(tf.global_variables_initializer())
y = tf.matmul(x,W) + b
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y))
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)
for _ in range(1000):
    batch = mnist.train.next_batch(100)
    train_step.run(feed_dict={x: batch[0], y_: batch[1]})
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print(accuracy.eval(feed_dict={x: mnist.test.images, y_: mnist.test.labels}))

第二个文件multiconv.py如下所示：

import tensorflow as tf

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

x = tf.placeholder(tf.float32, shape=[None, 784])
y_ = tf.placeholder(tf.float32, shape=[None, 10])


def weight_variable(shape):
    initial = tf.truncated_normal(shape, stddev=0.1)
    return tf.Variable(initial)

def bias_variable(shape):
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial)

def conv2d(x, W):
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
                    strides=[1, 2, 2, 1], padding='SAME')

W_conv1 = weight_variable([5, 5, 1, 32])
b_conv1 = bias_variable([32])

x_image = tf.reshape(x, [-1, 28, 28, 1])

h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)
W_conv2 = weight_variable([5, 5, 32, 64])
b_conv2 = bias_variable([64])

h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)

W_fc1 = weight_variable([7 * 7 * 64, 1024])
b_fc1 = bias_variable([1024])

h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

keep_prob = tf.placeholder(tf.float32)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

W_fc2 = weight_variable([1024, 10])
b_fc2 = bias_variable([10])

y_conv = tf.matmul(h_fc1_drop, W_fc2) + b_fc2

cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y_conv))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

config = tf.ConfigProto()
config.gpu_options.allow_growth=True

with tf.Session(config=config) as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(20000):
        batch = mnist.train.next_batch(50)
        if i % 100 == 0:
            train_accuracy = accuracy.eval(feed_dict={x: batch[0], y_: batch[1], keep_prob: 1.0})
            print('step %d, training accuracy %g' % (i, train_accuracy))
        train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})
    print('test accuracy %g' % accuracy.eval(feed_dict={x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0}))

如果我同时运行这两个程序 100 次

$cmd="python softmax.py"; for i in $(seq 100); do $cmd &>> temp.txt; sleep 1; done
Illegal instruction (core dumped)
Illegal instruction (core dumped)
Segmentation fault (core dumped)

和

$cmd="python multiconv.py"; for i in $(seq 100); do $cmd &>> temp.txt; sleep 1; done
Segmentation fault (core dumped)
Illegal instruction (core dumped)
Illegal instruction (core dumped)
Segmentation fault (core dumped)
Segmentation fault (core dumped)
Segmentation fault (core dumped)
Segmentation fault (core dumped)
Segmentation fault (core dumped)
Segmentation fault (core dumped)
Segmentation fault (core dumped)

因此，模型越复杂，它发生的频率就越高。

我相信我已经排除了内存问题作为潜在错误，因为我使用 nvidia-smi 命令观察内存输出并且它保持相当稳定。


+-----------------------------------------------------------------------------+
| NVIDIA-SMI 381.22                 Driver Version: 381.22                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 0000:02:00.0     Off |                  N/A |
| 30%   52C    P2    61W / 250W |    413MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 0000:81:00.0     Off |                  N/A |
| 27%   47C    P8    18W / 250W |    161MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     31956    C   python                                         403MiB |
|    1     31956    C   python                                         151MiB |
+-----------------------------------------------------------------------------+

我已经使用 gdb 捕获了一个错误，输出是这样的

0x00007fff8d3fbb90 in ?? () from /lib64/libcuda.so.1

如果需要，我可以提供完整的回溯。

有没有人有任何想法我可以开始进一步解决这个问题？

score 1 · Accepted Answer

看起来问题必须处理numpy。在 tensorflow-gpu pip 包安装的版本上使用 pip 重新安装 numpy 似乎可以修复它。

编辑：进一步的调查使我相信它来自于在 tensorflow 之后安装 scikit-learn 的冲突。numpy 版本会导致冲突。

python - 在 CentOS 7 上使用 Tensorflow 的零星核心转储（分段错误、非法指令）

1 回答 1

Related

Reference