在 Talos 的帮助下为我的 AlexNet 搜索最佳超参数时,我得到了内存不足错误。它总是发生在同一个时代(32/240),即使我稍微改变了参数(排除原因是不利的星座)。
错误信息:
ResourceExhaustedError: OOM when allocating tensor with shape[32,96,26,26] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node max_pooling2d_1/MaxPool (defined at D:\anaconda\envs\tf_ks\lib\site-packages\keras\backend\tensorflow_backend.py:3009) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[Op:__inference_keras_scratch_graph_246047]
Function call stack:
keras_scratch_graph
这是我的代码:
会话配置:
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth=True
config.gpu_options.per_process_gpu_memory_fraction = 0.99
sess = tf.compat.v1.Session(config = config)
K.set_session(sess)
AlexNet 的配置和拟合:
def alexnet(x_train, y_train, x_val, y_val, params):
K.clear_session()
if params['activation'] == 'leakyrelu':
activation_layer = LeakyReLU(alpha = params['leaky_alpha'])
elif params['activation'] == 'relu':
activation_layer = ReLU()
model = Sequential([
Conv2D(filters=96, kernel_size=(11,11), strides=(4,4), activation='relu', input_shape=(224,224,Global.num_image_channels)),
BatchNormalization(),
MaxPooling2D(pool_size=(3,3), strides=(2,2)),
Conv2D(filters=256, kernel_size=(5,5), strides=(1,1), activation='relu', padding="same"),
BatchNormalization(),
MaxPooling2D(pool_size=(3,3), strides=(2,2)),
Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), activation='relu', padding="same"),
BatchNormalization(),
Conv2D(filters=384, kernel_size=(1,1), strides=(1,1), activation='relu', padding="same"),
BatchNormalization(),
Conv2D(filters=256, kernel_size=(1,1), strides=(1,1), activation='relu', padding="same"),
BatchNormalization(),
MaxPooling2D(pool_size=(3,3), strides=(2,2)),
Flatten(),
Dense(4096, activation=activation_layer),
Dropout(0.5),#todo
Dense(4096, activation=activation_layer),
Dropout(0.5),#todo
Dense(units = 2, activation=activation_layer)
#Dense(10, activation='softmax')
])
model.compile(
optimizer = params['optimizer'](lr = lr_normalizer(params['lr'], params['optimizer'])),
loss = Global.loss_funktion,
metrics = get_reduction_metric(Global.reduction_metric)
)
train_generator, valid_generator = create_data_pipline(params['batch_size'], params['samples'])
tg_steps_per_epoch = train_generator.n // train_generator.batch_size
vg_validation_steps = valid_generator.n // valid_generator.batch_size
print('Steps per Epoch: {}, Validation Steps: {}'.format(tg_steps_per_epoch, vg_validation_steps))
startTime = datetime.now()
out = model.fit(
x = train_generator,
epochs = params['epochs'],
validation_data = valid_generator,
steps_per_epoch = tg_steps_per_epoch,
validation_steps = vg_validation_steps,
#callbacks = [checkpointer]
workers = 8
)
print("Time taken:", datetime.now() - startTime)
return out, model
超参数列表:
hyper_parameter = {
'samples': [20000],
'epochs': [1],
'batch_size': [32, 64],
'optimizer': [Adam],
'lr': [1, 2],
'first_neuron': [1024, 2048, 4096],
'dropout': [0.25, 0.50],
'activation': ['leakyrelu', 'relu'],
'hidden_layers': [0, 1, 2, 3, 4],
'leaky_alpha': [0.1] #Default bei LeakyReLU, sonst PReLU
}
运行 Talos:
dummy_x = np.empty((1, 2, 3, 224, 224))
dummy_y = np.empty((1, 2))
with tf.device('/device:GPU:0'):
t = ta.Scan(
x = dummy_x,
y = dummy_y,
model = alexnet,
params = hyper_parameter,
experiment_name = '{}'.format(Global.dataset),
#shuffle=False,
reduction_metric = Global.reduction_metric,
disable_progress_bar = False,
print_params = True,
clear_session = 'tf',
save_weights = False
)
t.data.to_csv(Global.target_dir + Global.results, index = True)
内存使用率总是很高,但它并没有随着时间的推移而上升,但变化不大。
英伟达 SMI 输出:
有人可以在这里帮助我吗?
==================================================== ======================== 我已经尝试过的:
1) 拆分 Talos 运行:
这导致了同样的错误。
hyper_parameter = {
'samples': [20000],
'epochs': [1],
'batch_size': [32, 64],
'optimizer': [Adam],
'lr': [1, 2, 3, 5],
'first_neuron': [9999],
'dropout': [0.25, 0.50],
'activation': ['leakyrelu', 'relu'],
'hidden_layers': [9999],
'leaky_alpha': [0.1] #Default bei LeakyReLU, sonst PReLU
}
dummy_x = np.empty((1, 2, 3, 224, 224))
dummy_y = np.empty((1, 2))
first = True
for h in [0, 1, 2, 3, 4]:
hyper_parameter['hidden_layers']=[h]
for fn in [1024, 2048, 4096]:
hyper_parameter['first_neuron']=[fn]
with tf.device('/device:GPU:1'):
t = ta.Scan(
x = dummy_x,
y = dummy_y,
model = alexnet,
params = hyper_parameter,
experiment_name = '{}'.format(Global.dataset),
#shuffle=False,
reduction_metric = Global.reduction_metric,
disable_progress_bar = False,
print_params = True,
clear_session = 'tf',
save_weights = False
)
if(first):
t.data.to_csv(Global.target_dir + Global.results, index = True, mode='a')
first = False
else:
t.data.to_csv(Global.target_dir + Global.results, index = True, mode='a', header=False)
==================================================== =========================
2)在自己的线程中运行模型
搜索原因,我发现有人抱怨同样的问题,并指责 TensorFlow 没有执行K.clear_session()
。
也许这个想法很愚蠢,但我试图在一个额外的线程中训练模型。
from threading import Thread
def gen_model_thread(x_train, y_train, x_val, y_val, params):
thread = Thread(target=alexnet, args=(x_train, y_train, x_val, y_val, params))
thread.start()
return_value = thread.join()
return return_value
with tf.device('/device:GPU:0'):
t = ta.Scan(
x = dummy_x,
y = dummy_y,
model = gen_model_thread,
params = hyper_parameter,
experiment_name = '{}'.format(Global.dataset),
#shuffle=False,
reduction_metric = Global.reduction_metric,
disable_progress_bar = False,
print_params = True,
clear_session = True,
save_weights = False
)
这导致了类型错误:
Traceback (most recent call last):
File "D:\anaconda\envs\tf_ks\lib\threading.py", line 926, in _bootstrap_inner
self.run()
File "D:\anaconda\envs\tf_ks\lib\threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "<ipython-input-3-2942ae0a0a56>", line 5, in gen_model
model = alexnet(params['activation'], params['leaky_alpha'])
File "<ipython-input-2-2a405202aa5a>", line 27, in alexnet
Dense(units = 2, activation=activation_layer)
File "D:\anaconda\envs\tf_ks\lib\site-packages\keras\engine\sequential.py", line 94, in __init__
self.add(layer)
File "D:\anaconda\envs\tf_ks\lib\site-packages\keras\engine\sequential.py", line 162, in add
name=layer.name + '_input')
File "D:\anaconda\envs\tf_ks\lib\site-packages\keras\engine\input_layer.py", line 178, in Input
input_tensor=tensor)
File "D:\anaconda\envs\tf_ks\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "D:\anaconda\envs\tf_ks\lib\site-packages\keras\engine\input_layer.py", line 87, in __init__
name=self.name)
File "D:\anaconda\envs\tf_ks\lib\site-packages\keras\backend\tensorflow_backend.py", line 73, in symbolic_fn_wrapper
if _SYMBOLIC_SCOPE.value:
AttributeError: '_thread._local' object has no attribute 'value'
TypeError: cannot unpack non-iterable NoneType object
我知道,我最后的机会是手动完成,但我想我会在以后训练我的模型时遇到同样的问题。
非常感谢您处理我的问题,阅读我的问题并更正我文本中的拼写错误^^。
我期待着从这个了不起的社区收到建设性的解决方案!(:
==================================================== =========================
GPU:NVIDIA RTX 2080Ti 和 Titan Xp 收藏版(我都试过了)
TensorFlow:2.1.0
喀拉斯:2.3.1
塔洛斯:1.0