reinforcement-learning - DDPG策略网络的输出可以是概率分布而不是某个动作值吗？

Question

我们知道 DDPG 是一种确定性的策略梯度方法，其策略网络的输出应该是某个动作。但是有一次我试图让策略网络的输出是几个动作的概率分布，这意味着输出的长度大于1，每个动作都有自己的概率，它们的和等于1。输出的形式看起来就像在随机策略梯度方法中一样，但是梯度是计算出来的，并且网络是以 DDPG 的方式更新的。最后，我发现结果看起来相当不错，但我不明白为什么它会起作用，因为输出形式并不完全符合 DDPG 的要求。

score 0 · Accepted Answer

It would work if you include also the gradient with respect to the distribution, otherwise it works just by chance.

If you do something like

probs = nn(s)
a = softmax(probs)
then backprop though softmax and back to nn

Then this is regular stochastic gradient using a softmax distribution, which was very common back then before deterministic gradient (and still used sometimes).

reinforcement-learning - DDPG策略网络的输出可以是概率分布而不是某个动作值吗？

1 回答 1

Related

Reference