我正在尝试使用keras
Amazon ec2 免费层云实例(1GB 内存)上的总数据集(1000 个 80 x 80 图像)非常小(总共 20 mb)运行训练会话但是该进程在运行model.fit()
2 个 epoch后被终止(有时它会变化,它会一直运行到 15 个)。我正在尝试禁用 oom 杀手或找到一些解决方法......有什么建议吗?您会在内存跟踪下方找到(它没有显示一些严重的数字,所以我想知道为什么脚本首先被杀死???)
错误:(可在 1GB 内存实例上重现)
64/870 [=>............................] - ETA: 12s - loss: 0.4477 - accuracy: 0.8750Traceback (most recent call last):
File "image_classifier.py", line 990, in <module>
clf.predict_folder_k_cnn(folder_path='test_photos_2/', label='One', epochs=50)
File "image_classifier.py", line 951, in predict_folder_k_cnn
model.fit(self.x_train, self.y_train, epochs=epochs, batch_size=batch_size, **(model_fit_args or {}))
File "/home/ec2-user/.local/lib/python3.7/site-packages/keras/engine/training.py", line 1239, in fit
validation_freq=validation_freq)
File "/home/ec2-user/.local/lib/python3.7/site-packages/keras/engine/training_arrays.py", line 196, in fit_loop
outs = fit_function(ins_batch)
File "/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/keras/backend.py", line 3510, in __call__
outputs = self._graph_fn(*converted_inputs)
File "/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 572, in __call__
return self._call_flat(args)
File "/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 671, in _call_flat
outputs = self._inference_function.call(ctx, args)
File "/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 445, in call
ctx=ctx)
File "/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[64,80,80,32] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
[[node gradients/max_pool/MaxPool_grad/MaxPoolGrad (defined at /home/ec2-user/.local/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:3009) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[Op:__inference_keras_scratch_graph_1638]
Function call stack:
keras_scratch_graph
dmesg
输出:
t:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:16kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 504.825883] lowmem_reserve[]: 0 932 932 932
[ 504.829525] Node 0 DMA32 free:44316kB min:44316kB low:55392kB high:66468kB active_anon:892184kB inactive_anon:256kB active_file:24kB inactive_file:0kB unevictable:0kB writepending:0kB present:1032192kB managed:991368kB mlocked:0kB kernel_stack:1952kB pagetables:7124kB bounce:0kB free_pcp:120kB local_pcp:120kB free_cma:0kB
[ 504.851094] lowmem_reserve[]: 0 0 0 0
[ 504.854427] Node 0 DMA: 10*4kB (UME) 11*8kB (UME) 13*16kB (UME) 15*32kB (UE) 9*64kB (UE) 8*128kB (UME) 6*256kB (UME) 1*512kB (E) 0*1024kB 0*2048kB 0*4096kB = 4464kB
[ 504.865932] Node 0 DMA32: 1101*4kB (UE) 781*8kB (UE) 458*16kB (UE) 317*32kB (UE) 121*64kB (UME) 46*128kB (UME) 6*256kB (U) 0*512kB 1*1024kB (M) 0*2048kB 0*4096kB = 44316kB
[ 504.877626] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 504.884964] 103 total pagecache pages
[ 504.888296] 0 pages in swap cache
[ 504.891399] Swap cache stats: add 0, delete 0, find 0/0
[ 504.895970] Free swap = 0kB
[ 504.898881] Total swap = 0kB
[ 504.901907] 262045 pages RAM
[ 504.904737] 0 pages HighMem/MovableOnly
[ 504.908299] 10227 pages reserved
[ 504.911383] 0 pages hwpoisoned
[ 504.914445] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
[ 504.921393] [ 1931] 0 1931 10278 97 28 3 0 0 systemd-journal
[ 504.928934] [ 1961] 0 1961 29191 67 28 4 0 0 lvmetad
[ 504.936328] [ 2655] 0 2655 16041 149 30 3 0 -1000 auditd
[ 504.943150] [ 2683] 81 2683 15123 118 35 3 0 -900 dbus-daemon
[ 504.950385] [ 2686] 32 2686 18423 178 38 3 0 0 rpcbind
[ 504.957604] [ 2690] 999 2690 3152 41 12 3 0 0 lsmd
[ 504.964760] [ 2691] 0 2691 3274 28 12 3 0 0 rngd
[ 504.972138] [ 2693] 0 2693 7117 89 19 3 0 0 systemd-logind
[ 504.979632] [ 2700] 997 2700 30649 135 33 3 0 0 chronyd
[ 504.987111] [ 2716] 0 2716 24457 163 35 3 0 0 gssproxy
[ 504.994331] [ 2920] 0 2920 25156 514 48 3 0 0 dhclient
[ 505.001383] [ 2961] 0 2961 25156 510 48 3 0 0 dhclient
[ 505.008709] [ 3105] 0 3105 22545 262 44 3 0 0 master
[ 505.015992] [ 3109] 89 3109 22567 253 44 3 0 0 pickup
[ 505.022854] [ 3110] 89 3110 22586 256 46 3 0 0 qmgr
[ 505.029730] [ 3157] 0 3157 117174 442 30 6 0 0 amazon-ssm-agen
[ 505.037492] [ 3159] 0 3159 54140 270 41 3 0 0 rsyslogd
[ 505.044641] [ 3199] 0 3199 30322 32 12 3 0 0 agetty
[ 505.051767] [ 3200] 0 3200 2634 33 11 3 0 0 agetty
[ 505.059124] [ 3333] 0 3333 38138 334 76 3 0 0 sshd
[ 505.066299] [ 3371] 0 3371 1065 26 8 3 0 0 acpid
[ 505.073401] [ 3414] 1000 3414 38175 390 73 3 0 0 sshd
[ 505.082220] [ 3415] 1000 3415 31219 269 16 3 0 0 bash
[ 505.089459] [ 3564] 0 3564 11355 132 24 3 0 -1000 systemd-udevd
[ 505.097212] [ 4261] 0 4261 28182 254 59 4 0 -1000 sshd
[ 505.103965] [ 4396] 0 4396 33767 158 21 4 0 0 crond
[ 505.110852] [ 4421] 0 4421 6968 50 19 3 0 0 atd
[ 505.118310] [22988] 1000 22988 33586 64 21 3 0 0 screen
[ 505.125710] [22989] 1000 22989 33621 128 19 3 0 0 screen
[ 505.132826] [22990] 1000 22990 31215 270 16 3 0 0 bash
[ 505.140153] [23011] 1000 23011 568549 219738 812 5 0 0 python3
[ 505.147922] Out of memory: Kill process 23011 (python3) score 875 or sacrifice child
[ 505.154309] Killed process 23011 (python3) total-vm:2274196kB, anon-rss:878952kB, file-rss:0kB, shmem-rss:0kB
[ 505.195909] oom_reaper: reaped process 23011 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ec2-user@ip-172-31-95-14 ~]$ python3 image_classifier.py
内存跟踪(1 epoch):
image_classifier.py:263: size=159 MiB, count=3, average=53.1 MiB
/home/ec2-user/.local/lib/python3.7/site-packages/tables/atom.py:1224: size=20.1 MiB, count=3715, average=5675 B
/home/ec2-user/.local/lib/python3.7/site-packages/matplotlib/lines.py:380: size=2597 KiB, count=1205, average=2207 B
/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/util/tf_stack.py:147: size=2546 KiB, count=26034, average=100 B
/home/ec2-user/.local/lib/python3.7/site-packages/matplotlib/transforms.py:179: size=1783 KiB, count=18009, average=101 B
/home/ec2-user/.local/lib/python3.7/site-packages/matplotlib/transforms.py:93: size=1487 KiB, count=16326, average=93 B
<frozen importlib._bootstrap_external>:525: size=1171 KiB, count=10792, average=111 B
/home/ec2-user/.local/lib/python3.7/site-packages/matplotlib/artist.py:75: size=1170 KiB, count=2859, average=419 B
/usr/lib64/python3.7/contextlib.py:82: size=791 KiB, count=5773, average=140 B
/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/util/tf_stack.py:131: size=608 KiB, count=22225, average=28 B
/home/ec2-user/.local/lib/python3.7/site-packages/matplotlib/cbook/__init__.py:795: size=565 KiB, count=61, average=9483 B
/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/util/tf_stack.py:136: size=552 KiB, count=20184, average=28 B
/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py:365: size=498 KiB, count=2520, average=202 B
/usr/lib64/python3.7/abc.py:143: size=462 KiB, count=3773, average=125 B
<__array_function__ internals>:6: size=342 KiB, count=6058, average=58 B
/home/ec2-user/.local/lib/python3.7/site-packages/matplotlib/transforms.py:180: size=317 KiB, count=6156, average=53 B
/home/ec2-user/.local/lib/python3.7/site-packages/numpy/core/_asarray.py:85: size=294 KiB, count=3926, average=77 B
/home/ec2-user/.local/lib/python3.7/site-packages/cycler.py:227: size=278 KiB, count=3253, average=87 B