我正在 gcp ai 平台上为 tensorflow 估计器运行一个训练作业,该估计器具有镜像分布策略--python-version 3.7
和--runtime-version 2.1
.
我在下面提供了必要的代码片段:
SESS_CONFIG = tf.compat.v1.ConfigProto(
allow_soft_placement=True,
log_device_placement=False,
intra_op_parallelism_threads=0,
gpu_options=tf.compat.v1.GPUOptions(force_gpu_compatible=True))
config = tf.estimator.RunConfig(save_summary_steps=10,
save_checkpoints_steps=20,
session_config=SESS_CONFIG,
keep_checkpoint_max=5,
log_step_count_steps=100,
train_distribute=tf.distribute.MirroredStrategy(), # Distribution Strategy
eval_distribute=tf.distribute.MirroredStrategy(), # Distribution Strategy
experimental_max_worker_delay_secs=None)
# -----------
custom_estimator_model = tf.estimator.Estimator(
model_fn=model_fn(), model_dir=model_dir,
config=config)
train_spec = tf.estimator.TrainSpec(input_fn=input_fn,
max_steps=train_steps)
eval_spec = tf.estimator.EvalSpec(input_fn=eval_input_fn,
steps=eval_steps,
exporters=exporters,
throttle_secs=eval_throttle_secs)
tf.estimator.train_and_evaluate(custom_estimator_model,
train_spec,
eval_spec)
配置:config.yaml
使用:
trainingInput:
masterType: complex_model_m_gpu
scaleTier: CUSTOM
该代码正在使用 tensorflow 1.14 和 Python 3.5 的 AI 平台上运行,并且在RunConfig()
策略中提供为train_distribute=tf.contrib.distribute.MirroredStrategy()
. 但在 TF2 升级后它被更改为train_distribute=tf.distribute.MirroredStrategy()
. 在此更改后,错误是:
错误:
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/.local/lib/python3.7/site-packages/trainer/task.py", line 239, in <module>
main()
File "/root/.local/lib/python3.7/site-packages/trainer/task.py", line 235, in main
model_dir=model_dir)
File "/root/.local/lib/python3.7/site-packages/trainer/models/models.py", line 244, in train_from_scratch
self.train_estimator(model_dir)
File "/root/.local/lib/python3.7/site-packages/trainer/models/models.py", line 234, in train_estimator
eval_spec)
File "/root/.local/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 463, in train_and_evaluate
_TrainingExecutor)
File "/root/.local/lib/python3.7/site-packages/tensorflow/python/distribute/estimator_training.py", line 290, in train_and_evaluate
session_config=run_config.session_config)
File "/root/.local/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_coordinator.py", line 836, in run_distribute_coordinator
task_type, task_id)
File "/root/.local/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_coordinator.py", line 548, in _configure_session_config_for_std_servers
task_id=task_id)
File "/root/.local/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1127, in configure
session_config, cluster_spec, task_type, task_id)
File "/root/.local/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 788, in _configure
self._initialize_multi_worker(multi_worker_devices)
File "/root/.local/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 510, in _initialize_multi_worker
device_dict = _group_device_list(devices)
File "/root/.local/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 265, in _group_device_list
assert not _is_device_list_single_worker(devices)
AssertionError