我一直在追踪 Tensorflow 中的 SEGFAULT。可以使用以下代码段重现该问题:
import tensorflow as tf
with tf.device('/cpu:0'):
xin = tf.placeholder(tf.float32, [None, 1, 1], name='input')
rnn_cell = tf.contrib.rnn.LSTMCell(1)
out, _ = tf.nn.dynamic_rnn(rnn_cell, xin, dtype=tf.float32)
out = tf.layers.batch_normalization(out, training=True)
out = tf.identity(out, name='output')
optimiser = tf.train.AdamOptimizer(.0001)
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
out = optimiser.minimize(out, global_step=tf.Variable(0, dtype=tf.float32), name='train_op')
config = tf.ConfigProto(allow_soft_placement = False)
sess = tf.Session(config=config)
sess.run(tf.global_variables_initializer())
sample_in = [[[0]]]
sess.run(out, feed_dict={xin: sample_in})
我已经设法找到了这个问题,并且我在 github 上有一个拉取请求。如果要使用我的补丁运行此代码,则会收到以下错误消息:
2018-04-03 13:09:24.326950: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 0 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:65:00.0
totalMemory: 11.90GiB freeMemory: 11.74GiB
2018-04-03 13:09:24.326982: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0
2018-04-03 13:09:24.512956: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11366 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:65:00.0, compute capability: 6.1)
Traceback (most recent call last):
File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1361, in _do_call
return fn(*args)
File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _run_fn
target_list, status, run_metadata)
File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: Cycle detected when adding enter->frame edge: Edge from gradients/f_count to (null) would create a cycle.
+-> (null)
| rnn/TensorArrayStack/TensorArrayGatherV3
| rnn/transpose_1
| batch_normalization/moments/mean
| batch_normalization/moments/Squeeze
| batch_normalization/AssignMovingAvg/sub
| batch_normalization/AssignMovingAvg/mul
| batch_normalization/AssignMovingAvg
+-- gradients/f_count
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "breakage.py", line 21, in <module>
sess.run(out, feed_dict={xin: sample_in})
File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1137, in _run
feed_dict_tensor, options, run_metadata)
File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1355, in _do_run
options, run_metadata)
File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1374, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Cycle detected when adding enter->frame edge: Edge from gradients/f_count to (null) would create a cycle.
+-> (null)
| rnn/TensorArrayStack/TensorArrayGatherV3
| rnn/transpose_1
| batch_normalization/moments/mean
| batch_normalization/moments/Squeeze
| batch_normalization/AssignMovingAvg/sub
| batch_normalization/AssignMovingAvg/mul
| batch_normalization/AssignMovingAvg
+-- gradients/f_count
这似乎表明我的示例代码存在拓扑问题。每当我将任何类型的 RNN、批量归一化和所需的附加控制依赖性结合起来时,问题似乎就会发生
我已经设法通过依赖tf.contrib.layers.batch_norm
并将updates_collections
参数设置None
为来内联更新操作来缓解这个问题。
作为参考,这里是更新的代码示例:
import tensorflow as tf
with tf.device('/cpu:0'):
xin = tf.placeholder(tf.float32, [None, 1, 1], name='input')
rnn_cell = tf.contrib.rnn.LSTMCell(1)
out, _ = tf.nn.dynamic_rnn(rnn_cell, xin, dtype=tf.float32)
out = tf.contrib.layers.batch_norm(out, is_training=True, updates_collections=None)
out = tf.identity(out, name='output')
optimiser = tf.train.AdamOptimizer(.0001)
out = optimiser.minimize(out, global_step=tf.Variable(0, dtype=tf.float32), name='train_op')
config = tf.ConfigProto(allow_soft_placement = False)
sess = tf.Session(config=config)
sess.run(tf.global_variables_initializer())
sample_in = [[[0]]]
sess.run(out, feed_dict={xin: sample_in})
根据文档,这可能会对性能产生不利影响,而且我首先不清楚我做错了什么。我的代码看起来正确吗?
另请注意,仅当 Tensorflow 使用 XLA JIT 支持构建时才会出现此问题,这让我认为这可能是 Tensorflow 中的一个错误。
编辑:我还在Github 上提交了一个问题