我正在尝试训练网络。并且培训因错误而中断,指出文件太大并且无法刷新事件(tf.summary)。当我检查文件实际上是 4GB+
2021-08-08 15:16:44.725004: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at summary_kernels.cc:142 : Out of range: gldv2_training/train_logs/events.out.tfevents.1628396331.XX.9717.627.v2; File too large
Failed to flush 11 events to gldv2_training/train_logs/events.out.tfevents.1628396331.XX.9717.627.v2
Could not flush events file.
我正在使用 delf 实现(在此处找到训练脚本:https ://github.com/tensorflow/models/blob/master/research/delf/delf/python/training/train.py )。我之前在对象检测模型中也遇到过这个问题。我记得我只是通过省略事件编写来避免这个问题。但我想为这个问题找到一个解决方案/原因。
我发现有关缺少事件文件的问题,但无法为我的问题找到一个好的答案。我会感谢你在这件事上的帮助。
更新我到目前为止所尝试的内容
- 更改 max_queue 参数 - 注意到 tf.summary.create_file_writer 有一个默认值为 10 的 max_queue 参数。所以我认为它给出了错误,因为它超过了这个默认值,所以无法刷新 11 个事件。所以我尝试了不同的值 20、200 并最终得到错误提示无法刷新 21 个事件、201 个事件等。
- 更改 flush_millis - 将其更改为较低的值以更频繁地刷新事件
- summary_writer.flush() - 代码没有 flush() 可能是因为它应该在 flush_millis 次之后刷新),但我尝试添加 flush() 以查看它是否有帮助
不幸的是,这些都不起作用。
初始错误的完整错误堆栈
2021-08-08 15:16:44.725004: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at summary_kernels.cc:142 : Out of range: gldv2_training/train_logs/events.out.tfevents.1628396331.XX.9717.627.v2; File too large
Failed to flush 11 events to gldv2_training/train_logs/events.out.tfevents.1628396331.XX.9717.627.v2
Could not flush events file.
Traceback (most recent call last):
File "train.py", line 486, in <module>
app.run(main)
File "/home/user/.local/lib/python3.8/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/home/user/.local/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "train.py", line 410, in main
desc_dist_loss, attn_dist_loss = distributed_train_step(input_batch)
File "/home/user/.local/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 889, in __call__
result = self._call(*args, **kwds)
File "/home/user/.local/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 917, in _call
return self._stateless_fn(*args, **kwds) # pylint: disable=not-callable
File "/home/user/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 3023, in __call__
return graph_function._call_flat(
File "/home/user/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1960, in _call_flat
return self._build_call_outputs(self._inference_function.call(
File "/home/user/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 591, in call
outputs = execute.execute(
File "/home/user/.local/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.OutOfRangeError: 2 root error(s) found.
(0) Out of range: gldv2_training/train_logs/events.out.tfevents.1628396331.XX.9717.627.v2; File too large
Failed to flush 11 events to gldv2_training/train_logs/events.out.tfevents.1628396331.XX.9717.627.v2
Could not flush events file.
[[node attention/percent_25/write_summary (defined at /home/user/.local/lib/python3.8/site-packages/tensorboard/plugins/scalar/summary_v2.py:89) ]]
(1) Out of range: gldv2_training/train_logs/events.out.tfevents.1628396331.XX.9717.627.v2; File too large
Failed to flush 11 events to gldv2_training/train_logs/events.out.tfevents.1628396331.XX.9717.627.v2
Could not flush events file.
[[node attention/percent_25/write_summary (defined at /home/user/.local/lib/python3.8/site-packages/tensorboard/plugins/scalar/summary_v2.py:89) ]]
[[GroupCrossDeviceControlEdges_0/Identity_2/_107]]
0 successful operations.
0 derived errors ignored. [Op:__inference_distributed_train_step_12517]
Errors may have originated from an input operation.
Input Source operations connected to node attention/percent_25/write_summary:
batch_images/write_summary/writer (defined at /home/user/.local/lib/python3.8/site-packages/tensorboard/plugins/image/summary_v2.py:140)
Input Source operations connected to node attention/percent_25/write_summary:
batch_images/write_summary/writer (defined at /home/user/.local/lib/python3.8/site-packages/tensorboard/plugins/image/summary_v2.py:140)
Function call stack:
distributed_train_step -> distributed_train_step