5

我一直在追踪 Tensorflow 中的 SEGFAULT。可以使用以下代码段重现该问题:

import tensorflow as tf                                                                                                                                                                                                                       

with tf.device('/cpu:0'):
    xin = tf.placeholder(tf.float32, [None, 1, 1], name='input')
    rnn_cell = tf.contrib.rnn.LSTMCell(1)
    out, _ = tf.nn.dynamic_rnn(rnn_cell, xin, dtype=tf.float32)
    out = tf.layers.batch_normalization(out, training=True)
    out = tf.identity(out, name='output')

    optimiser = tf.train.AdamOptimizer(.0001)
    update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
    with tf.control_dependencies(update_ops):
        out = optimiser.minimize(out, global_step=tf.Variable(0, dtype=tf.float32), name='train_op')

config = tf.ConfigProto(allow_soft_placement = False)
sess = tf.Session(config=config)
sess.run(tf.global_variables_initializer())

sample_in = [[[0]]]
sess.run(out, feed_dict={xin: sample_in})

我已经设法找到了这个问题,并且我在 github 上有一个拉取请求。如果要使用我的补丁运行此代码,则会收到以下错误消息:

2018-04-03 13:09:24.326950: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 0 with properties:                                                                                                                          
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:65:00.0
totalMemory: 11.90GiB freeMemory: 11.74GiB
2018-04-03 13:09:24.326982: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0
2018-04-03 13:09:24.512956: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11366 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:65:00.0, compute capability: 6.1)
Traceback (most recent call last):
  File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1361, in _do_call
    return fn(*args)
  File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _run_fn
    target_list, status, run_metadata)
  File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: Cycle detected when adding enter->frame edge: Edge from gradients/f_count to (null) would create a cycle.
+-> (null)
|   rnn/TensorArrayStack/TensorArrayGatherV3
|   rnn/transpose_1
|   batch_normalization/moments/mean
|   batch_normalization/moments/Squeeze
|   batch_normalization/AssignMovingAvg/sub
|   batch_normalization/AssignMovingAvg/mul
|   batch_normalization/AssignMovingAvg
+-- gradients/f_count


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "breakage.py", line 21, in <module>
    sess.run(out, feed_dict={xin: sample_in})
  File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1137, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1355, in _do_run
    options, run_metadata)
  File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1374, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Cycle detected when adding enter->frame edge: Edge from gradients/f_count to (null) would create a cycle.
+-> (null)
|   rnn/TensorArrayStack/TensorArrayGatherV3
|   rnn/transpose_1
|   batch_normalization/moments/mean
|   batch_normalization/moments/Squeeze
|   batch_normalization/AssignMovingAvg/sub
|   batch_normalization/AssignMovingAvg/mul
|   batch_normalization/AssignMovingAvg
+-- gradients/f_count

这似乎表明我的示例代码存在拓扑问题。每当我将任何类型的 RNN、批量归一化和所需的附加控制依赖性结合起来时,问题似乎就会发生

批量标准化说明

我已经设法通过依赖tf.contrib.layers.batch_norm并将updates_collections参数设置None为来内联更新操作来缓解这个问题。

作为参考,这里是更新的代码示例:

import tensorflow as tf                                                                                                                                                                                                                       

with tf.device('/cpu:0'):
    xin = tf.placeholder(tf.float32, [None, 1, 1], name='input')
    rnn_cell = tf.contrib.rnn.LSTMCell(1)
    out, _ = tf.nn.dynamic_rnn(rnn_cell, xin, dtype=tf.float32)
    out = tf.contrib.layers.batch_norm(out, is_training=True, updates_collections=None)
    out = tf.identity(out, name='output')

    optimiser = tf.train.AdamOptimizer(.0001)
    out = optimiser.minimize(out, global_step=tf.Variable(0, dtype=tf.float32), name='train_op')

config = tf.ConfigProto(allow_soft_placement = False)
sess = tf.Session(config=config)
sess.run(tf.global_variables_initializer())

sample_in = [[[0]]]
sess.run(out, feed_dict={xin: sample_in})

根据文档,这可能会对性能产生不利影响,而且我首先不清楚我做错了什么。我的代码看起来正确吗?

另请注意,仅当 Tensorflow 使用 XLA JIT 支持构建时才会出现此问题,这让我认为这可能是 Tensorflow 中的一个错误。

编辑:我还在Github 上提交了一个问题

4

0 回答 0