我Ray rllib
使用库在 8 核 CPU 的 sagemaker 上运行sagemaker_rl
,我将 num_workers 设置为 7。
经过长时间的处决后,我面临The actor died unexpectedly before finishing this task
class MyLauncher(SageMakerRayLauncher):
def register_env_creator(self):
register_env(
"RiveRL-v1",
lambda env_config: create_env(env_config),
)
def get_experiment_config(self):
return {
"training": {
"env": "RiveRL-v1",
"run": "PPO",
"config": {
"ignore_worker_failures": True,
"gamma": 0.6,
"num_sgd_iter": 5,
"lr": 0.0001,
"sgd_minibatch_size": 32768,
"train_batch_size": 100000,
"use_gae": False,
"num_workers": (self.num_cpus - 1),
"num_gpus": self.num_gpus,
"batch_mode": "complete_episodes",
"env_config": {
"window_size": 25,
"max_allowed_loss": 0.2
},
"observation_filter": "MeanStdFilter",
"entropy_coeff": 0.01,
},
"checkpoint_freq": 2,
}
}
失败 #1(发生在 2021-10-20_18-35-15) Traceback(最近一次调用最后一次):
文件“/usr/local/lib/python3.6/dist-packages/ray/tune/trial_runner.py”,第 467 行,在 _process_trial 结果 = self.trial_executor.fetch_result(trial) 文件“/usr/local/lib/python3.6/dist-packages/ray/tune/ray_trial_executor.py”,第 431 行,在 fetch_result 结果 = ray. get(trial_future[0], DEFAULT_GET_TIMEOUT) File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 1517, in get raise value ray.exceptions.RayActorError: 演员意外死亡在完成这项任务之前。
但是每当我改变问题num_worker
时1
,问题就解决了。知道如何解决这个问题吗?