我正在尝试使用 tensorflow.keras 和 horovod 在 AWS p2.8xlarge 上使用带有 nvidia-docker 的自定义训练循环 (train_on_batch) 运行分布式训练。我的代码很乱,所以发布它不会太有用。这是输出的链接,对我来说信息量不大。代码使用 using 运行没有错误python run_trn.py
。有关如何探测此错误的任何建议?
相关Horovod代码
import horovod.tensorflow.keras as hvd
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')
...
if self.horovod:
lr_factor = hvd.size()
else:
lr_factor = 1
optimizer = Adam(
beta_1=self.adam_optimizer['beta1'],
beta_2=self.adam_optimizer['beta2'],
lr=self.learning_rate['initial_value'] * lr_factor,
epsilon=self.adam_optimizer['epsilon'],
)
if self.horovod:
optimizer = hvd.DistributedOptimizer(optimizer)
...
if self.horovod:
gcb = hvd.callbacks.BroadcastGlobalVariablesCallback(0)
dcb = hvd.callbacks.BroadcastGlobalVariablesCallback(0)
fecb = hvd.callbacks.BroadcastGlobalVariablesCallback(0)
ccb = hvd.callbacks.BroadcastGlobalVariablesCallback(0)
gcb.set_model(self.generator)
dcb.set_model(self.discriminator)
fecb.set_model(self.feature_extractor)
ccb.set_model(self.model)
gcb.on_train_begin()
dcb.on_train_begin()
fecb.on_train_begin()
ccb.on_train_begin()
...
if self.horovod:
dataset = dataset.shard(hvd.size(), hvd.rank())
...
if hvd.rank()==0:
self.helper.save_batch(epoch, step, batch, self.generator.model.predict,
flatness, preview_path='sample_batches')
...
这是我的命令:
nvidia-docker run --privileged --gpus all -v $PWD:/opt/project -it horovod/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.6.0.post0-py3.7-cuda10.1 /bin/bash
cd /opt/project
python -m pip install --upgrade pip
pip install joblib scikit-image tqdm
horovodrun -np 8 -H localhost:8 python run_trn.py
在某些时候,gpus 正在被使用。
ip-172-31-1-23 Wed May 12 00:26:19 2021 450.80.02
[0] Tesla K80 | 32'C, 0 % | 8447 / 11441 MB | root(8444M)
[1] Tesla K80 | 34'C, 0 % | 8447 / 11441 MB | root(8444M)
[2] Tesla K80 | 29'C, 0 % | 8447 / 11441 MB | root(8444M)
[3] Tesla K80 | 34'C, 0 % | 8447 / 11441 MB | root(8444M)
[4] Tesla K80 | 31'C, 0 % | 8447 / 11441 MB | root(8444M)
[5] Tesla K80 | 34'C, 0 % | 8447 / 11441 MB | root(8444M)
[6] Tesla K80 | 37'C, 0 % | 8447 / 11441 MB | root(8444M)
[7] Tesla K80 | 38'C, 0 % | 8447 / 11441 MB | root(8444M)