0

我目前正在尝试构建一种算法来最大化投资组合的终端财富。我正在使用带有 Sutton 和 Barto (2018) 中存在的基线算法的 REINFORCE。我有一个用于策略的神经网络,它以当前的财富和投资期限剩余的时间作为输入,并输出两个值:正态分布的平均值和标准差。然后从该分布中抽取投资于风险资产的贴现美元金额。我有另一个用于价值函数的网络(相同的输入但输出状态值)。我已经分析地解决了这个问题,我的价值网络很好地收敛到了最优解。我的策略网络没有,这让我相信我可以改进网络架构以“帮助”它找到最佳解决方案。我对 pytorch 和神经网络相当陌生,因此我会很感激我如何做到这一点的想法。我的策略网络在下面,它有两个隐藏层,每个隐藏层有 32 个节点。我也玩过学习率,它似乎没有太大帮助。谢谢!

class PolicyNetwork(nn.Module):
    ''' Neural Network for the policy, which is taken to be normally distributed hence
    this network returns a mean and variance '''
    def __init__(self, lr, input_dims, fc1_dims, fc2_dims, n_returns):
        super(PolicyNetwork, self).__init__()
        self.input_dims = input_dims
        self.fc1_dims = fc1_dims
        self.fc2_dims = fc2_dims
        self.n_returns = n_returns
        self.lr = lr
        self.fc1 = nn.Linear(*self.input_dims, self.fc1_dims) # inputs should be wealth and time to maturity
        self.fc2 = nn.Linear(self.fc1_dims,self.fc2_dims)
        self.fc3 = nn.Linear(self.fc2_dims,n_returns) # returns mean and sd of normal dist
        self.optimizer = optim.Adam(self.parameters(), lr = lr)
        
    def forward(self, observation):
        state = torch.Tensor(observation).float().unsqueeze(0)
        x = F.relu(self.fc1(state))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        first_slice = x[:,0]
        second_slice = x[:,1]
        tuple_of_activated_parts = (
                first_slice, # let mean be negative
                #F.relu(first_slice), # make sure mean is positive
                #torch.sigmoid(second_slice) # make sure sd is positive
                F.softplus(second_slice) # make sd positive but dont trap below 1
                )
        out = torch.cat(tuple_of_activated_parts, dim=-1)
        return out
4

0 回答 0