0

我正在尝试训练网络。并且培训因错误而中断,指出文件太大并且无法刷新事件(tf.summary)。当我检查文件实际上是 4GB+

2021-08-08 15:16:44.725004: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at summary_kernels.cc:142 : Out of range: gldv2_training/train_logs/events.out.tfevents.1628396331.XX.9717.627.v2; File too large
    Failed to flush 11 events to gldv2_training/train_logs/events.out.tfevents.1628396331.XX.9717.627.v2
    Could not flush events file.

我正在使用 delf 实现(在此处找到训练脚本:https ://github.com/tensorflow/models/blob/master/research/delf/delf/python/training/train.py )。我之前在对象检测模型中也遇到过这个问题。我记得我只是通过省略事件编写来避免这个问题。但我想为这个问题找到一个解决方案/原因。

我发现有关缺少事件文件的问题,但无法为我的问题找到一个好的答案。我会感谢你在这件事上的帮助。

更新我到目前为止所尝试的内容

  • 更改 max_queue 参数 - 注意到 tf.summary.create_file_writer 有一个默认值为 10 的 max_queue 参数。所以我认为它给出了错误,因为它超过了这个默认值,所以无法刷新 11 个事件。所以我尝试了不同的值 20、200 并最终得到错误提示无法刷新 21 个事件、201 个事件等。
  • 更改 flush_millis - 将其更改为较低的值以更频繁地刷新事件
  • summary_writer.flush() - 代码没有 flush() 可能是因为它应该在 flush_millis 次之后刷新),但我尝试添加 flush() 以查看它是否有帮助

不幸的是,这些都不起作用。

初始错误的完整错误堆栈

    2021-08-08 15:16:44.725004: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at summary_kernels.cc:142 : Out of range: gldv2_training/train_logs/events.out.tfevents.1628396331.XX.9717.627.v2; File too large
    Failed to flush 11 events to gldv2_training/train_logs/events.out.tfevents.1628396331.XX.9717.627.v2
    Could not flush events file.
Traceback (most recent call last):
  File "train.py", line 486, in <module>
    app.run(main)
  File "/home/user/.local/lib/python3.8/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/home/user/.local/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "train.py", line 410, in main
    desc_dist_loss, attn_dist_loss = distributed_train_step(input_batch)
  File "/home/user/.local/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 889, in __call__
    result = self._call(*args, **kwds)
  File "/home/user/.local/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 917, in _call
    return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable
  File "/home/user/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 3023, in __call__
    return graph_function._call_flat(
  File "/home/user/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1960, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "/home/user/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 591, in call
    outputs = execute.execute(
  File "/home/user/.local/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.OutOfRangeError: 2 root error(s) found.
  (0) Out of range:  gldv2_training/train_logs/events.out.tfevents.1628396331.XX.9717.627.v2; File too large
    Failed to flush 11 events to gldv2_training/train_logs/events.out.tfevents.1628396331.XX.9717.627.v2
    Could not flush events file.
     [[node attention/percent_25/write_summary (defined at /home/user/.local/lib/python3.8/site-packages/tensorboard/plugins/scalar/summary_v2.py:89) ]]
  (1) Out of range:  gldv2_training/train_logs/events.out.tfevents.1628396331.XX.9717.627.v2; File too large
    Failed to flush 11 events to gldv2_training/train_logs/events.out.tfevents.1628396331.XX.9717.627.v2
    Could not flush events file.
     [[node attention/percent_25/write_summary (defined at /home/user/.local/lib/python3.8/site-packages/tensorboard/plugins/scalar/summary_v2.py:89) ]]
     [[GroupCrossDeviceControlEdges_0/Identity_2/_107]]
0 successful operations.
0 derived errors ignored. [Op:__inference_distributed_train_step_12517]

Errors may have originated from an input operation.
Input Source operations connected to node attention/percent_25/write_summary:
 batch_images/write_summary/writer (defined at /home/user/.local/lib/python3.8/site-packages/tensorboard/plugins/image/summary_v2.py:140)

Input Source operations connected to node attention/percent_25/write_summary:
 batch_images/write_summary/writer (defined at /home/user/.local/lib/python3.8/site-packages/tensorboard/plugins/image/summary_v2.py:140)

Function call stack:
distributed_train_step -> distributed_train_step
4

0 回答 0