0

我正在尝试训练 LSTM,同时保持其隐藏状态(LSTM 有状态),直到我要开始一个新的时代(剧集)。但是这里出现了一个有趣的情况,因为我在尝试这样做时遇到了以下错误:

RuntimeError:梯度计算所需的变量之一已被就地操作修改:[torch.cuda.FloatTensor [705, 25]] 为版本 3;而是预期的版本 2。提示:上面的回溯显示了未能计算其梯度的操作。有问题的变量在那里或以后的任何地方都被改变了。祝你好运!

这发生在我试图保持隐藏状态时,例如,我将使用一条细线:

dist, _ = self.actor(state, h_out)并且删除retain_graph=True所有内容都会正常工作。

你们中的任何人都可以帮助我了解这里发生了什么以及我该如何解决这个问题吗?

我在这里有我的训练循环:

for ep in range(conf.num_episode):
    state = env.reset()
    step = 0

    qnet_agent.hidden = None
    qnet_agent.hidden_2 = None
    while True:
        step += 1
        frames_total += 1

        epsilon = calculate_epsilon(frames_total)

        action, smart_decision = qnet_agent.select_action(state, epsilon)

        new_state, reward, done, info = env.step(action)

        memory.push(state, action, new_state, reward, done)

        qnet_agent.optimize()
        state = new_state

        if done:
            steps_total.append(step)
            break

这是我的优化功能:

 def optimize(self):
    if len(self.memory) < self.config.batch_size:
        return

    state, action, new_state, reward, done = self.memory.sample(batch_size=self.config.batch_size)

    state = torch.Tensor(np.array(state)).to(device)
    new_state = torch.Tensor(np.array(new_state)).to(device)
    reward = torch.Tensor(reward).to(device)
    action = torch.LongTensor(action).to(device)
    done = torch.Tensor(done).to(device)

    h_out = self.hidden
    dist, self.hidden = self.actor(state, h_out)
    dist = torch.distributions.Categorical(dist)

    advantage = reward + (1 - done) * self.config.gamma * self.critic(new_state).squeeze(1) - self.critic(state).squeeze(1)

    critic_loss = advantage.pow(2).mean()
    self.optimizer_critic.zero_grad()
    critic_loss.backward()
    self.optimizer_critic.step()

    actor_loss = -dist.log_prob(action) * advantage.detach()
    self.optimizer_actor.zero_grad()
    actor_loss.mean().backward(retain_graph=True)
    self.optimizer_actor.step()
4

0 回答 0