pytorch - 具有策略和目标网络的 DQN 在 LunarLander 环境中无法正确学习

Question

我正在尝试掌握强化学习的窍门，因此我正在遵循以下指南： pytorch.org/tutorials/

他们实施了 DQN，用计算机视觉解决了 CartPole。基本上，我复制了他们的代码并对其进行了修改，以解决没有计算机视觉的 LunarLander 环境。但我得到了奇怪的结果。该模型似乎正在学习，因为它提高了它的分数（有很多小问题），直到它严重失败并卡住，做奇怪的动作而不是学习。

学习进度图

不同模型的另一个学习进度图

您可以看到两个模型在学习结束时都以相同的方式失败。

我无法弄清楚为什么这个解决方案不起作用。你能看看我的代码，也许能找到并指出错误吗？

全局变量：

BATCH_SIZE = 1000
GAMMA = 0.999
EPS_START = 0.9
EPS_END = 0.05
EPS_DECAY = 1000
TARGET_UPDATE = 10
LEARNING_RATE = 0.01
MOMENTUM = 0.9
MEMORY_SIZE = 10000
env = gym.make('LunarLander-v2')
n_actions = env.action_space.n
n_observation_space = env.observation_space.shape[0]
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
policy_net = DQN(n_observation_space, n_actions).to(device)
target_net = DQN(n_observation_space, n_actions).to(device)
target_net.load_state_dict(policy_net.state_dict())
target_net.eval()
optimizer = optim.Adam(policy_net.parameters(), lr=LEARNING_RATE)
memory = ReplayMemory(MEMORY_SIZE)

学习循环：

def learn(num_episodes=50, render=False):
for i_episode in range(num_episodes):
    # Initialize the environment and state
    state = torch.tensor([env.reset()], device=device, dtype=torch.float32)
    episode_reward = 0
    for t in count():
        # Select and perform an action
        action = select_action(state)
        next_state, reward, done, _ = env.step(action.item())
        episode_reward += reward
        reward = torch.tensor([reward], device=device, dtype=torch.float32)
        next_state = torch.tensor([next_state], device=device, dtype=torch.float32)

        # Store the transition in memory
        memory.push(state, action, next_state, reward)

        # Move to the next state
        state = next_state

        # Perform one step of the optimization (on the target network)
        optimize_model()

        if render:
            env.render()

        if done:
            break
    all_rewards.append(episode_reward)
    # Update the target network, copying all weights and biases in DQN
    if i_episode % TARGET_UPDATE == 0:
        target_net.load_state_dict(policy_net.state_dict())

优化方法：

def optimize_model():
if len(memory) < BATCH_SIZE:
    return
transitions = memory.sample(BATCH_SIZE)
batch = Transition(*zip(*transitions))

non_final_mask = torch.tensor(tuple(map(lambda s: s is not None,
                                        batch.next_state)), device=device, dtype=torch.bool)
non_final_next_states = torch.cat([s for s in batch.next_state
                                   if s is not None])
state_batch = torch.cat(batch.state)
action_batch = torch.cat(batch.action)
reward_batch = torch.cat(batch.reward)

state_action_values = policy_net(state_batch).gather(1, action_batch)

next_state_values = torch.zeros(BATCH_SIZE, device=device)
next_state_values[non_final_mask] = target_net(non_final_next_states).max(1)[0].detach()

expected_state_action_values = (next_state_values * GAMMA) + reward_batch

loss = nn.MSELoss(state_action_values, expected_state_action_values.unsqueeze(1))

# Optimize the model
optimizer.zero_grad()
loss.backward()
for param in policy_net.parameters():
    param.grad.data.clamp_(-1, 1)
optimizer.step()

模型：

class DQN(nn.Module):

def __init__(self, input_size, output_size):
    super(DQN, self).__init__()
    self.l1 = nn.Linear(input_size, 512)
    self.l2 = nn.Linear(512, 512)
    self.l3 = nn.Linear(512, 256)
    self.l4 = nn.Linear(256, output_size)

def forward(self, x):
    x = F.leaky_relu(self.l1(x))
    x = F.leaky_relu(self.l2(x))
    x = F.leaky_relu(self.l3(x))
    return self.l4(x)

如果有人愿意在本地运行我的代码，请告诉我。我将清理代码并通过 Github 分享。

score 0 · Accepted Answer

查看您的代码，我似乎找不到任何突出的错误（但您没有发布所有内容）。不过也有一些奇怪的地方：

1000 的ABATCH_SIZE相当大。当然，您应该尝试最适合您的方式，但下次尝试使用 32/64/128 及左右。
由于您没有发布用于选择动作和EPS衰减的函数，因此我假设您EPS在每个时间步都以 1/1000 的衰减率衰减您的函数。鉴于您正在使用一个非常大的网络，请尝试让您的 epsilon 衰减更慢。
如上所述，环境很容易通过更小的网络来解决。更大的网络有更多的权重，更多的权重需要更多的时间来训练。我会通过至少删除一个隐藏层来减小网络的大小，您也可以尝试减少您拥有的单元数量。

pytorch - 具有策略和目标网络的 DQN 在 LunarLander 环境中无法正确学习

1 回答 1

Related

Reference