我使用 Ray RLlib 的 DQN 在我的自定义模拟器中进行训练。它通常在 1500 万步后产生良好的结果。
在玩了一段时间 DQN 之后,我现在正在尝试在模拟器中训练 A2C。但是,如下图所示,它甚至还没有接近收敛。通常,在我的模拟器中,-50 被认为是最大值,使用 DQN 最多可以达到 1500 万步。
DQN 和 A2C 的模拟器完全相同:
- 71 个离散观测
- 3个离散动作。
我认为这两种算法的环境都不需要改变。也许我错了……
有人能想到 A2C 没有在我的模拟器中学习的原因吗?
A2C 的参数:
(与 Ray RLlib 上的默认配置相同)
# Should use a critic as a baseline (otherwise don't use value baseline;
# required for using GAE).
"use_critic": True,
# If true, use the Generalized Advantage Estimator (GAE)
# with a value function, see https://arxiv.org/pdf/1506.02438.pdf.
"use_gae": True,
# Size of rollout batch
"rollout_fragment_length": 20,
# GAE(gamma) parameter
"lambda": 1.0,
# Max global norm for each gradient calculated by worker
"grad_clip": 40.0,
# Learning rate
"lr": 0.0001,
# Learning rate schedule
"lr_schedule": None,
# Value Function Loss coefficient
"vf_loss_coeff": 0.5,
# Entropy coefficient
"entropy_coeff": 0.01,
# Min time per iteration
"min_iter_time_s": 10,
# Workers sample async. Note that this increases the effective
# rollout_fragment_length by up to 5x due to async buffering of batches.
"sample_async": False,
# Switch on Trajectory View API for A2/3C by default.
# NOTE: Only supported for PyTorch so far.
"_use_trajectory_view_api": True,
# A2C supports microbatching, in which we accumulate gradients over
# batch of this size until the train batch size is reached. This allows
# training with batch sizes much larger than can fit in GPU memory.
# To enable, set this to a value less than the train batch size.
"microbatch_size": None