我正在尝试用洋红色训练一个自定义模型,它搭载了 tensorflow-gpu。问题是无论如何,tensorflow 都无法正确分配我的 GPU 内存并开始训练。作为记录,这是我正在使用的命令:
t2t_trainer --data_dir="{folder}" --hparams="label_smoothing=0.0, max_length=0,max_target_seq_length=4096" --hparams_set=score2perf_transformer_base --model=transformer --output_dir="{folder}" --problem=score2perf_maestro_language_uncropped_aug --train_steps=2500
当 seq_length 设置为 2048 时,这没有问题,仅使用大约 25% 的 CPU 和 GPU 功率。我有一个 i7-9600k 和一个 RTX 2070,有 8 GB 的 VRAM。但是,当我将其增加到 4096 时,即使是最小的 GPU 分配量,它也会开始失败。这是日志的(压缩)版本:
2019-11-14 14:38:14.028064: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 7.60G (8160437760 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-14 14:38:14.028311: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 6.84G (7344393728 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
WARNING:tensorflow:From c:\python\lib\site-packages\tensorflow_core\python\training\saver.py:1069: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file utilities to get mtimes.
W1114 14:38:14.551839 9104 deprecation.py:323] From c:\python\lib\site-packages\tensorflow_core\python\training\saver.py:1069: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file utilities to get mtimes.
INFO:tensorflow:Running local_init_op.
I1114 14:38:14.811158 9104 session_manager.py:500] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I1114 14:38:14.944813 9104 session_manager.py:502] Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into C:\Users\conspiracy2\Documents\music comp\out\11-14 set 2.0\checkpts\model.ckpt.
I1114 14:38:17.920329 9104 basic_session_run_hooks.py:606] Saving checkpoints for 0 into C:\Users\conspiracy2\Documents\music comp\out\11-14 set 2.0\checkpts\model.ckpt.
2019-11-14 14:38:21.598678: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
2019-11-14 14:38:22.574418: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 1.44G (1550483456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-14 14:38:22.574642: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 1.44G (1550483456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-14 14:38:32.575117: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 1.44G (1550483456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-14 14:38:32.575322: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 1.44G (1550483456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-14 14:38:32.575478: W tensorflow/core/common_runtime/bfc_allocator.cc:419] Allocator (GPU_0_bfc) ran out of memory trying to allocate 576.00MiB (rounded to 603979776). Current allocation summary follows.
2019-11-14 14:38:32.575683: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (256): Total Chunks: 75, Chunks in use: 69. 18.8KiB allocated for chunks. 17.3KiB in use in bin. 304B client-requested in use in bin.
2019-11-14 14:38:32.575871: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (512): Total Chunks: 1, Chunks in use: 0. 512B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-11-14 14:38:32.576033: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (1024): Total Chunks: 2, Chunks in use: 2. 2.3KiB allocated for chunks. 2.3KiB in use in bin. 2.0KiB client-requested in use in bin.
2019-11-14 14:38:32.576206: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (2048): Total Chunks: 96, Chunks in use: 96. 192.0KiB allocated for chunks. 192.0KiB in use in bin. 192.0KiB client-requested in use in bin.
2019-11-14 14:38:32.576406: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (4096): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-11-14 14:38:32.576604: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (8192): Total Chunks: 19, Chunks in use: 18. 158.8KiB allocated for chunks. 144.0KiB in use in bin. 144.0KiB client-requested in use in bin.
2019-11-14 14:38:32.576926: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (16384): Total Chunks: 13, Chunks in use: 12. 208.0KiB allocated for chunks. 192.0KiB in use in bin. 192.0KiB client-requested in use in bin.
2019-11-14 14:38:32.577128: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (32768): Total Chunks: 48, Chunks in use: 48. 1.82MiB allocated for chunks. 1.82MiB in use in bin. 1.82MiB client-requested in use in bin.
2019-11-14 14:38:32.577355: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (65536): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-11-14 14:38:32.577566: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (131072): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-11-14 14:38:32.577770: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (262144): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-11-14 14:38:32.577973: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (524288): Total Chunks: 1, Chunks in use: 1. 620.0KiB allocated for chunks. 620.0KiB in use in bin. 620.0KiB client-requested in use in bin.
2019-11-14 14:38:32.578238: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (1048576): Total Chunks: 72, Chunks in use: 72. 72.00MiB allocated for chunks. 72.00MiB in use in bin. 72.00MiB client-requested in use in bin.
2019-11-14 14:38:32.578395: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (2097152): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-11-14 14:38:32.578561: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (4194304): Total Chunks: 37, Chunks in use: 36. 151.84MiB allocated for chunks. 144.00MiB in use in bin. 144.00MiB client-requested in use in bin.
2019-11-14 14:38:32.578834: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (8388608): Total Chunks: 38, Chunks in use: 37. 304.00MiB allocated for chunks. 296.00MiB in use in bin. 296.00MiB client-requested in use in bin.
2019-11-14 14:38:32.579017: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (16777216): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-11-14 14:38:32.579203: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (33554432): Total Chunks: 6, Chunks in use: 6. 192.00MiB allocated for chunks. 192.00MiB in use in bin. 192.00MiB client-requested in use in bin.
2019-11-14 14:38:32.579489: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (67108864): Total Chunks: 2, Chunks in use: 1. 160.00MiB allocated for chunks. 64.00MiB in use in bin. 64.00MiB client-requested in use in bin.
2019-11-14 14:38:32.579704: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (134217728): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2019-11-14 14:38:32.579998: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (268435456): Total Chunks: 11, Chunks in use: 10. 5.29GiB allocated for chunks. 5.00GiB in use in bin. 5.00GiB client-requested in use in bin.
2019-11-14 14:38:32.580279: I tensorflow/core/common_runtime/bfc_allocator.cc:885] Bin for 576.00MiB was 256.00MiB, Chunk State:
2019-11-14 14:38:32.580407: I tensorflow/core/common_runtime/bfc_allocator.cc:891] Size: 300.92MiB | Requested Size: 0B | in_use: 0 | bin_num: 20, prev: Size: 512.00MiB | Requested Size: 512.00MiB |
2019-11-14 14:38:32.643932: I tensorflow/core/common_runtime/bfc_allocator.cc:921] Sum Total of in-use chunks: 5.75GiB
2019-11-14 14:38:32.644132: I tensorflow/core/common_runtime/bfc_allocator.cc:923] total_region_allocated_bytes_: 6609954304 memory_limit_: 8160437862 available bytes: 1550483558 curr_region_allocation_bytes_: 16320876032
2019-11-14 14:38:32.644377: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats:
Limit: 8160437862
InUse: 6177115648
MaxInUse: 6185504256
NumAllocs: 611
MaxAllocSize: 603979776
2019-11-14 14:38:32.644686: W tensorflow/core/common_runtime/bfc_allocator.cc:424] *************************************_**********************************************************____
2019-11-14 14:38:32.644868: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at pad_op.cc:122 : Resource exhausted: OOM when allocating tensor with shape[16777216,9] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
File "c:\python\lib\site-packages\tensorflow_core\python\client\session.py", line 1365, in _do_call
return fn(*args)
File "c:\python\lib\site-packages\tensorflow_core\python\client\session.py", line 1350, in _run_fn
target_list, run_metadata)
File "c:\python\lib\site-packages\tensorflow_core\python\client\session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[16777216,9] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node transformer/parallel_0_4/transformer/transformer/body/decoder/layer_2/self_attention/multihead_attention/dot_product_attention/Pad}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
我已在此处将 pastebin 附加到“完整”相关日志:https ://pastebin.com/CQpYdUC4
为了解决明显的问题,不,我没有使用 GPU 运行任何其他程序,不,我没有运行多个实例。它甚至无法分配 512 MB 的 GPU 使用量,即使应该有高达 ~8 GB 的可用空间。
我尝试在 t2t_trainer.py 脚本中手动将 memory_fraction 降低到 0.2,并尝试设置“allow_growth”。这些似乎都没有帮助,尽管将 memory_fraction 设置为 0.2 确实降低了可用内存,并且一开始只是开始尝试分配 1.44 GB 而不是 7 GB。
我已经黔驴技穷了。作为记录,这是 Tensorflow 1.14 和 CUDA 10.0,因为模型需要它。