-2

我在 MacBook Pro 上运行旧 GPU(NVIDIA GeForce 9600M GT 512 MB)和 OS X 10.11.6 上的 CUDA 4.5。(Tensorflow 需要 CUDA 7.5 或更高版本才能使用 GPU)。

我在 Tensorflow 中训练洋红色模型时遇到此错误:

INFO:tensorflow:超时等待检查点。

这是我的命令和输出。

$ bazel run //magenta/models/melody_rnn:melody_rnn_train -- --config=attention_rnn --run_dir=/tmp/melody_rnn/logdir/run1 --sequence_example_file=/Users/davidlaxer/magenta/magenta/testdata/notesequences.tfrecord --hparams="batch_size=10,rnn_layer_sizes=[64,64]" --num_training_steps=20000 --eval
INFO: Found 1 target...
Target //magenta/models/melody_rnn:melody_rnn_train up-to-date:
  bazel-bin/magenta/models/melody_rnn/melody_rnn_train
INFO: Elapsed time: 0.561s, Critical Path: 0.09s

INFO: Running command line: bazel-bin/magenta/models/melody_rnn/melody_rnn_train '--config=attention_rnn' '--run_dir=/tmp/melody_rnn/logdir/run1' '--sequence_example_file=/Users/davidlaxer/magenta/magenta/testdata/notesequences.tfrecord' '--hparams=batch_size=10,rnn_layer_sizes=[64,64]' '--num_training_steps=20000' --eval
INFO:tensorflow:hparams = {'rnn_layer_sizes': [64, 64], 'attn_length': 40, 'dropout_keep_prob': 0.5, 'batch_size': 10, 'clip_norm': 3, 'learning_rate': 0.001}
INFO:tensorflow:[<tf.Tensor 'ParseSingleSequenceExample/ParseSingleSequenceExample:0' shape=(?, 74) dtype=float32>, <tf.Tensor 'ParseSingleSequenceExample/ParseSingleSequenceExample:1' shape=(?,) dtype=int64>, <tf.Tensor 'strided_slice:0' shape=() dtype=int32>]
INFO:tensorflow:Train dir: /tmp/melody_rnn/logdir/run1/train
INFO:tensorflow:Eval dir: /tmp/melody_rnn/logdir/run1/eval
INFO:tensorflow:Counting records in /Users/davidlaxer/magenta/magenta/testdata/notesequences.tfrecord.
INFO:tensorflow:Total records: 46
INFO:tensorflow:Waiting for new checkpoint at /tmp/melody_rnn/logdir/run1/train
INFO:tensorflow:Timed-out waiting for a checkpoint.
David-Laxers-MacBook-Pro:magenta davidlaxer$ 

这个错误的原因是什么?

还尝试调整超时:

$ bazel run //magenta/models/melody_rnn:melody_rnn_train -- --config=attention_rnn --run_dir=/tmp/melody_rnn/logdir/run1 --sequence_example_file=/Users/davidlaxer/magenta/magenta/testdata/notesequences.tfrecord --hparams="batch_size=10,rnn_layer_sizes=[64,64]" **--save_summaries_secs=10000 --save_interval_secs=10000** --num_training_steps=20000 --eval

我删除了 --eval 指令,它开始训练模型:

$ ls -l /tmp/melody_rnn/logdir/run1/train/
total 11032
-rw-r--r--  1 davidlaxer  wheel      149 Jul 20 16:04 checkpoint
-rw-r--r--  1 davidlaxer  wheel  2438765 Jul 20 16:04 events.out.tfevents.1500591842.David-Laxers-MacBook-Pro.local
-rw-r--r--  1 davidlaxer  wheel  1300637 Jul 20 16:04 graph.pbtxt
-rw-r--r--  1 davidlaxer  wheel  1226008 Jul 20 16:04 model.ckpt-0.data-00000-of-00001
-rw-r--r--  1 davidlaxer  wheel     1727 Jul 20 16:04 model.ckpt-0.index
-rw-r--r--  1 davidlaxer  wheel   667410 Jul 20 16:04 model.ckpt-0.meta


$ bazel run //magenta/models/melody_rnn:melody_rnn_train -- --config=attention_rnn --run_dir=/tmp/melody_rnn/logdir/run1 --sequence_example_file=/Users/davidlaxer/magenta/magenta/testdata/notesequences.tfrecord --hparams="batch_size=10,rnn_layer_sizes=[64,64]" **--save_summaries_secs=10000 --save_interval_secs=10000** --num_training_steps=20000 
Killed non-responsive server process (pid=65119)
.
INFO: Found 1 target...
Target //magenta/models/melody_rnn:melody_rnn_train up-to-date:
  bazel-bin/magenta/models/melody_rnn/melody_rnn_train
INFO: Elapsed time: 9.400s, Critical Path: 0.65s

INFO: Running command line: bazel-bin/magenta/models/melody_rnn/melody_rnn_train '--config=attention_rnn' '--run_dir=/tmp/melody_rnn/logdir/run1' '--sequence_example_file=/Users/davidlaxer/magenta/magenta/testdata/notesequences.tfrecord' '--hparams=batch_size=10,rnn_layer_sizes=[64,64]' '**--save_summaries_secs=10000' '--save_interval_secs=10000**' '--num_training_steps=20000'
INFO:tensorflow:hparams = {'rnn_layer_sizes': [64, 64], 'attn_length': 40, 'dropout_keep_prob': 0.5, 'batch_size': 10, 'clip_norm': 3, 'learning_rate': 0.001}
INFO:tensorflow:Counting records in /Users/davidlaxer/magenta/magenta/testdata/notesequences.tfrecord.
INFO:tensorflow:Total records: 46
INFO:tensorflow:[<tf.Tensor 'random_shuffle_queue_Dequeue:0' shape=(?, 74) dtype=float32>, <tf.Tensor 'random_shuffle_queue_Dequeue:1' shape=(?,) dtype=int64>, <tf.Tensor 'random_shuffle_queue_Dequeue:2' shape=() dtype=int32>]
INFO:tensorflow:Train dir: /tmp/melody_rnn/logdir/run1/train
INFO:tensorflow:Starting training loop...
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Error reported to Coordinator: <type 'exceptions.UnicodeDecodeError'>, 'utf8' codec can't decode byte 0xe0 in position 132: invalid continuation byte
INFO:tensorflow:Saving checkpoints for 0 into /tmp/melody_rnn/logdir/run1/train/model.ckpt.
Traceback (most recent call last):
  File "/private/var/tmp/_bazel_davidlaxer/182280691ad889ad33cd20c0640dc2b1/execroot/magenta/bazel-out/local-opt/bin/magenta/models/melody_rnn/melody_rnn_train.runfiles/__main__/magenta/models/melody_rnn/melody_rnn_train.py", line 112, in <module>
    console_entry_point()
  File "/private/var/tmp/_bazel_davidlaxer/182280691ad889ad33cd20c0640dc2b1/execroot/magenta/bazel-out/local-opt/bin/magenta/models/melody_rnn/melody_rnn_train.runfiles/__main__/magenta/models/melody_rnn/melody_rnn_train.py", line 108, in console_entry_point
    tf.app.run(main)
  File "/Users/davidlaxer/anaconda/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/private/var/tmp/_bazel_davidlaxer/182280691ad889ad33cd20c0640dc2b1/execroot/magenta/bazel-out/local-opt/bin/magenta/models/melody_rnn/melody_rnn_train.runfiles/__main__/magenta/models/melody_rnn/melody_rnn_train.py", line 104, in main
    checkpoints_to_keep=FLAGS.num_checkpoints)
  File "/private/var/tmp/_bazel_davidlaxer/182280691ad889ad33cd20c0640dc2b1/execroot/magenta/bazel-out/local-opt/bin/magenta/models/melody_rnn/melody_rnn_train.runfiles/__main__/magenta/models/shared/events_rnn_train.py", line 71, in run_training
    save_summaries_steps=summary_frequency)
  File "/Users/davidlaxer/anaconda/lib/python2.7/site-packages/tensorflow/contrib/training/python/training/training.py", line 530, in train
    loss = session.run(train_op)
  File "/Users/davidlaxer/anaconda/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 521, in __exit__
    self._close_internal(exception_type)
  File "/Users/davidlaxer/anaconda/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 556, in _close_internal
    self._sess.close()
  File "/Users/davidlaxer/anaconda/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 791, in close
    self._sess.close()
  File "/Users/davidlaxer/anaconda/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 888, in close
    ignore_live_threads=True)
  File "/Users/davidlaxer/anaconda/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/Users/davidlaxer/anaconda/lib/python2.7/site-packages/tensorflow/python/training/queue_runner_impl.py", line 238, in _run
    enqueue_callable()
  File "/Users/davidlaxer/anaconda/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1063, in _single_operation_run
    target_list_as_strings, status, None)
  File "/Users/davidlaxer/anaconda/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/Users/davidlaxer/anaconda/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 465, in raise_exception_on_not_ok_status
    compat.as_text(pywrap_tensorflow.TF_Message(status)),
  File "/Users/davidlaxer/anaconda/lib/python2.7/site-packages/tensorflow/python/util/compat.py", line 84, in as_text
    return bytes_or_text.decode(encoding)
  File "/Users/davidlaxer/anaconda/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 132: invalid continuation byte
ERROR: Non-zero return code '1' from command: Process exited with status 1.
4

1 回答 1

2

指定时--eval,您正在运行评估而不是训练。eval 作业将等待检查点,run_dir如果没有找到检查点,它将退出。

于 2017-07-20T17:26:59.283 回答