我正在尝试掌握强化学习的窍门,因此我正在遵循以下指南: pytorch.org/tutorials/
他们实施了 DQN,用计算机视觉解决了 CartPole。基本上,我复制了他们的代码并对其进行了修改,以解决没有计算机视觉的 LunarLander 环境。但我得到了奇怪的结果。该模型似乎正在学习,因为它提高了它的分数(有很多小问题),直到它严重失败并卡住,做奇怪的动作而不是学习。
您可以看到两个模型在学习结束时都以相同的方式失败。
我无法弄清楚为什么这个解决方案不起作用。你能看看我的代码,也许能找到并指出错误吗?
全局变量:
BATCH_SIZE = 1000
GAMMA = 0.999
EPS_START = 0.9
EPS_END = 0.05
EPS_DECAY = 1000
TARGET_UPDATE = 10
LEARNING_RATE = 0.01
MOMENTUM = 0.9
MEMORY_SIZE = 10000
env = gym.make('LunarLander-v2')
n_actions = env.action_space.n
n_observation_space = env.observation_space.shape[0]
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
policy_net = DQN(n_observation_space, n_actions).to(device)
target_net = DQN(n_observation_space, n_actions).to(device)
target_net.load_state_dict(policy_net.state_dict())
target_net.eval()
optimizer = optim.Adam(policy_net.parameters(), lr=LEARNING_RATE)
memory = ReplayMemory(MEMORY_SIZE)
学习循环:
def learn(num_episodes=50, render=False):
for i_episode in range(num_episodes):
# Initialize the environment and state
state = torch.tensor([env.reset()], device=device, dtype=torch.float32)
episode_reward = 0
for t in count():
# Select and perform an action
action = select_action(state)
next_state, reward, done, _ = env.step(action.item())
episode_reward += reward
reward = torch.tensor([reward], device=device, dtype=torch.float32)
next_state = torch.tensor([next_state], device=device, dtype=torch.float32)
# Store the transition in memory
memory.push(state, action, next_state, reward)
# Move to the next state
state = next_state
# Perform one step of the optimization (on the target network)
optimize_model()
if render:
env.render()
if done:
break
all_rewards.append(episode_reward)
# Update the target network, copying all weights and biases in DQN
if i_episode % TARGET_UPDATE == 0:
target_net.load_state_dict(policy_net.state_dict())
优化方法:
def optimize_model():
if len(memory) < BATCH_SIZE:
return
transitions = memory.sample(BATCH_SIZE)
batch = Transition(*zip(*transitions))
non_final_mask = torch.tensor(tuple(map(lambda s: s is not None,
batch.next_state)), device=device, dtype=torch.bool)
non_final_next_states = torch.cat([s for s in batch.next_state
if s is not None])
state_batch = torch.cat(batch.state)
action_batch = torch.cat(batch.action)
reward_batch = torch.cat(batch.reward)
state_action_values = policy_net(state_batch).gather(1, action_batch)
next_state_values = torch.zeros(BATCH_SIZE, device=device)
next_state_values[non_final_mask] = target_net(non_final_next_states).max(1)[0].detach()
expected_state_action_values = (next_state_values * GAMMA) + reward_batch
loss = nn.MSELoss(state_action_values, expected_state_action_values.unsqueeze(1))
# Optimize the model
optimizer.zero_grad()
loss.backward()
for param in policy_net.parameters():
param.grad.data.clamp_(-1, 1)
optimizer.step()
模型:
class DQN(nn.Module):
def __init__(self, input_size, output_size):
super(DQN, self).__init__()
self.l1 = nn.Linear(input_size, 512)
self.l2 = nn.Linear(512, 512)
self.l3 = nn.Linear(512, 256)
self.l4 = nn.Linear(256, output_size)
def forward(self, x):
x = F.leaky_relu(self.l1(x))
x = F.leaky_relu(self.l2(x))
x = F.leaky_relu(self.l3(x))
return self.l4(x)
如果有人愿意在本地运行我的代码,请告诉我。我将清理代码并通过 Github 分享。