0

我是 PyTorch 和强化学习的新手,因此很抱歉,如果这条消息听起来很愚蠢或解决方案太简单,但我不知道如何解决这个问题,我已经花了几天时间研究这个问题并试图找到一个解决这个问题的方法,我做不到。如果你们中的任何人可以帮助我或至少给我一些建议,我将不胜感激。

我正在尝试建立一个在市场上买卖股票的模型,该模型将只有2 个可能的操作,即BUYSELL。此外,我正在尝试使用2 个相互连接的 GRU模型来实现Actor-Critic模型,并且只为连接的 Critic 模型使用一些简单的线性层,因为我想看看与我的普通模型相比有多好。案子。


class ActorNN(nn.Module):
    def __init__(self, stock_env: StockEnv, conf: wandb):
        super(ActorNN, self).__init__()
        self.stock_env = stock_env
        self.input_size = 20
        self.hidden_size_1 = 235
        self.hidden_size_2 = 135
        self.num_layers_1 = 2
        self.num_layers_2 = 3
        self.batch_size = 350

        output_size = self.stock_env.action_space.n # (2 - BUY, SELL)
        self.lstm = nn.GRU(input_size=self.input_size, hidden_size=self.hidden_size_1,
                           num_layers=self.num_layers_1, dropout=conf.dropout_1)
        self.lstm_2 = nn.GRU(input_size=self.hidden_size_1, hidden_size=self.hidden_size_2,
                             num_layers=self.num_layers_2, dropout=conf.dropout_2)
        self.output_layer = nn.Linear(self.hidden_size_2, output_size)
        self.activation = nn.Tanh()

    def forward(self, x):
        x = self.activation(x.view(len(x), -1, self.input_size))
        out, new_hidden_1 = self.lstm(x)
        out = self.activation(new_hidden_1)
        out, _ = self.lstm_2(out)
        out = self.activation(out)
        out = self.output_layer(out)
        return out


class CriticNN(nn.Module):
    def __init__(self, stock_env: StockEnv):
        # stock_env.window_size = 300
        super(CriticNN, self).__init__()
        self.stock_env = stock_env
        self.l1 = nn.Linear(stock_env.window_size * 25, 128)
        self.l2 = nn.Linear(128, 256)
        self.l3 = nn.Linear(256, 1)
        self.activation = nn.ReLU()

    def forward(self, x):
        output = self.activation(self.l1(torch.flatten(x, start_dim=1)))
        output = self.activation(self.l2(output))
        output = self.l3(output)
        return output

现在我的问题出现在代理的优化功能上


def optimize(self):
    if len(self.memory) < self.config.batch_size:
        return
    
    state, action, new_state, reward, done = self.memory.sample(batch_size=self.config.batch_size)

    state = torch.Tensor(np.array(state)).to(device)
    new_state = torch.Tensor(np.array(new_state)).to(device)
    reward = torch.Tensor(reward).to(device)
    action = torch.LongTensor(action).to(device)
    done = torch.Tensor(done).to(device)
    dist = torch.distributions.Categorical(self.actor(state))
    advantage = reward + (1 - done) * self.config.gamma * self.critic(new_state).squeeze(1) - self.critic(state).squeeze(1)

    critic_loss = advantage.pow(2).mean()
    self.optimizer_critic.zero_grad()
    critic_loss.backward()
    self.optimizer_critic.step()

    actor_loss = -dist.log_prob(action) * advantage.detach()
    self.optimizer_actor.zero_grad()
    actor_loss.mean().backward()
    self.optimizer_actor.step()

当我尝试初始化我的dist变量时,它将是形状[3(hidden layer), 300(windows_size), 2(nr of actions)]而我action的形状是[350 (batch_size)]

现在,当我尝试运行dist.log_prob(action)时,我收到一条错误消息:

张量 a (350) 的大小必须与非单维 1 处的张量 b(300) 的大小相匹配

这是因为我的 dist 与 action 的形状不同,我的问题来了,我怎样才能让它们匹配?你们中的任何人都可以帮助我吗?我尝试使用多个线性层来匹配它们的大小,但我无法让它们数学化。

4

0 回答 0