我正在尝试在以下存储库中重现 Mask RCNN 的训练:https ://github.com/maxkferg/metal-defect-detection
火车的代码片段如下:
# Training - Stage 1
print("Training network heads")
model.train(dataset_train, dataset_val,
learning_rate=config.LEARNING_RATE,
epochs=40,
layers='heads')
# Training - Stage 2
# Finetune layers from ResNet stage 4 and up
print("Fine tune Resnet stage 4 and up")
model.train(dataset_train, dataset_val,
learning_rate=config.LEARNING_RATE,
epochs=120,
layers='4+')
# # Training - Stage 3
# # Fine tune all layers
print("Fine tune all layers")
model.train(dataset_train, dataset_val,
learning_rate=config.LEARNING_RATE / 10,
epochs=160,
layers='all')
第一阶段进展顺利。但从第 2 阶段失败。给出以下内容:
2020-08-17 15:53:10.685456: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 123 个总大小为 2048 的块246.0KiB 2020-08-17 15:53:10.685456: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 块大小2816 总计 2.8KiB 2020-08-17 15:53:10.686456: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 6 块大小 3072 总计 18.0KiB 2020-08-17 15:53:10.686456: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 387 个大小为 4096 的块,总计 1.51MiB 2020-08-17 15:53:10.687456:IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 个大小为 6144 的块,总计 6.0KiB 2020-08-17 15:53:10.687456:IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator。 cc:680] 1 个大小为 6656 的块,总计 6.5KiB 2020-08-17 15:53:10.688456: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\ bfc_allocator.cc:680] 60 个大小为 8192 的块,总计 480.0KiB 2020-08-17 15:53:10.688456: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\ common_runtime\bfc_allocator.cc:680] 2 个大小为 9216 的块,总计 18.0KiB 2020-08-17 15:53:10.689456: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\ core\common_runtime\bfc_allocator.cc:680] 12 个大小为 12288 的块,总计 144.0KiB 2020-08-17 15:53:10.689456:IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 2 个大小为 16384 的块,总计 32.0KiB 2020-08-17 15:53:10.690456: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 个大小为 21248 的块共 20.8KiB 2020-08-17 15:53: 10.691456: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 个大小为 24064 的块,总计 23.5KiB 2020-08-17 15: 53:10.691456: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 5 个大小为 24576 的块,总计 120.0KiB 2020-08-17 15:53:10.692456: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 个大小为 37632 的块,共 36 个。8KiB 2020-08-17 15:53:10.692456: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 个大小为 40960 的块总计 40.0KiB 2020-08-17 15:53:10.693456: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 4 块大小 49152 总计 192.0KiB 2020-08-17 15:53:10.693456: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 6大小为 65536 的块总计 384.0KiB 2020-08-17 15:53:10.694456: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680 ] 1 个大小为 81920 的块,总计 80.0KiB 2020-08-17 15:53:10.695456: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc :680] 1 个大小为 90624 的块,总计 88.5KiB 2020-08-17 15:53:10.695456:IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator。 cc:680] 1 个大小为 131072 的块,总计 128.0KiB 2020-08-17 15:53:10.695456: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\ bfc_allocator.cc:680] 3 个大小为 147456 的块,总计 432.0KiB 2020-08-17 15:53:10.696456: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\ common_runtime\bfc_allocator.cc:680] 12 个大小为 262144 的块,总计 3.00MiB 2020-08-17 15:53:10.696456: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\ core\common_runtime\bfc_allocator.cc:680] 1 个大小为 327680 的块,总计 320.0KiB 2020-08-17 15:53:10.697457:IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 11 个大小为 524288 的块,总计 5.50MiB 2020-08-17 15:53:10.697457: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 4 个大小为 589824 的块共 2.25MiB 2020-08-17 15:53: 10.698457: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 194 个大小为 1048576 的块,总计 194.00MiB 2020-08-17 15: 53:10.699457: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 17 个大小为 2097152 的块,总计 34.00MiB 2020-08-17 15:53:10.699457: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 个大小为 2211840 的块,共 2 个。11MiB 2020-08-17 15:53:10.700457: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 146 个大小为 2359296 的块总计 328.50MiB 2020-08-17 15:53:10.701457: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 块大小 2360320 总计 2.25MiB 2020-08-17 15:53:10.701457: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1大小为 2621440 的块,总计 2.50MiB 2020-08-17 15:53:10.702457:IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680 ] 1 个大小为 2698496 的块,总计 2.57MiB 2020-08-17 15:53:10.702457:IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc :680] 1 个大小为 3670016 的块,总计 3.50MiB 2020-08-17 15:53:10.703457:IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator。 cc:680] 31 个大小为 4194304 的块,总计 124.00MiB 2020-08-17 15:53:10.703457: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\ bfc_allocator.cc:680] 6 个大小为 4718592 的块,总计 27.00MiB 2020-08-17 15:53:10.704457: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\ common_runtime\bfc_allocator.cc:680] 5 个大小为 8388608 的块,总计 40.00MiB 2020-08-17 15:53:10.705457: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\ core\common_runtime\bfc_allocator.cc:680] 25 个大小为 9437184 的块,总计 225.00MiB 2020-08-17 15:53:10.705457:IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 2 个大小为 9438208 的块,总计 18.00MiB 2020-08-17 15:53:10.706457: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 个大小为 9441280 的块,总计 9.00MiB 2020-08-17 15:53: 10.706457: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 个大小为 16138752 的块,总计 15.39MiB 2020-08-17 15: 53:10.707457: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 个大小为 18874368 的块,总计 18.00MiB 2020-08-17 15:53:10.707457: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 个大小为 37748736 的块,共 36 个。00MiB 2020-08-17 15:53:10.708457: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 7 个大小为 51380224 的块总计 343.00MiB 2020-08-17 15:53:10.708457: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:684] 总和正在使用的块:1.41GiB 2020-08-17 15:53:10.709457:IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:686 ] 统计数据:限制:1613615104 使用中:1510723072 最大使用中:1510723072 numAllocs:3860 MaxAllocSize:11994777600MiB 2020-08-17 15:53:10.708457: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:684] 总和使用块:1.41GiB 2020-08-17 15:53:10.709457: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:686] 统计: 限制:1613615104 InUse:1510723072 MaxInUse:1510723072 NumAllocs:3860 MaxAllocSize:11994777600MiB 2020-08-17 15:53:10.708457: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:684] 总和使用块:1.41GiB 2020-08-17 15:53:10.709457: IC:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:686] 统计: 限制:1613615104 InUse:1510723072 MaxInUse:1510723072 NumAllocs:3860 MaxAllocSize:1199477761510723072 MaxInUse:1510723072 NumAllocs:3860 MaxAllocSize:1199477761510723072 MaxInUse:1510723072 NumAllocs:3860 MaxAllocSize:119947776
训练在具有 2GB RAM 的 QuadroK420 上运行。只是内存不足的问题还是我遗漏了什么?还有一种方法可以用我的设备进行训练吗?