我目前正在尝试构建一种算法来最大化投资组合的终端财富。我正在使用带有 Sutton 和 Barto (2018) 中存在的基线算法的 REINFORCE。我有一个用于策略的神经网络,它以当前的财富和投资期限剩余的时间作为输入,并输出两个值:正态分布的平均值和标准差。然后从该分布中抽取投资于风险资产的贴现美元金额。我有另一个用于价值函数的网络(相同的输入但输出状态值)。我已经分析地解决了这个问题,我的价值网络很好地收敛到了最优解。我的策略网络没有,这让我相信我可以改进网络架构以“帮助”它找到最佳解决方案。我对 pytorch 和神经网络相当陌生,因此我会很感激我如何做到这一点的想法。我的策略网络在下面,它有两个隐藏层,每个隐藏层有 32 个节点。我也玩过学习率,它似乎没有太大帮助。谢谢!
class PolicyNetwork(nn.Module):
''' Neural Network for the policy, which is taken to be normally distributed hence
this network returns a mean and variance '''
def __init__(self, lr, input_dims, fc1_dims, fc2_dims, n_returns):
super(PolicyNetwork, self).__init__()
self.input_dims = input_dims
self.fc1_dims = fc1_dims
self.fc2_dims = fc2_dims
self.n_returns = n_returns
self.lr = lr
self.fc1 = nn.Linear(*self.input_dims, self.fc1_dims) # inputs should be wealth and time to maturity
self.fc2 = nn.Linear(self.fc1_dims,self.fc2_dims)
self.fc3 = nn.Linear(self.fc2_dims,n_returns) # returns mean and sd of normal dist
self.optimizer = optim.Adam(self.parameters(), lr = lr)
def forward(self, observation):
state = torch.Tensor(observation).float().unsqueeze(0)
x = F.relu(self.fc1(state))
x = F.relu(self.fc2(x))
x = self.fc3(x)
first_slice = x[:,0]
second_slice = x[:,1]
tuple_of_activated_parts = (
first_slice, # let mean be negative
#F.relu(first_slice), # make sure mean is positive
#torch.sigmoid(second_slice) # make sure sd is positive
F.softplus(second_slice) # make sd positive but dont trap below 1
)
out = torch.cat(tuple_of_activated_parts, dim=-1)
return out