algorithm - 资格跟踪算法，更新顺序

Question

我正在阅读Silver et al (2012) "Temporal-Difference Search in Computer Go"，并试图了解资格跟踪算法的更新顺序。在论文的算法 1 和 2 中，在更新资格跟踪之前更新了权重。我想知道这个顺序是否正确（算法 1 的第 11 和 12 行，以及算法 2 的第 12 和 13 行）。考虑一个极端的情况lambda=0，参数不会用初始状态-动作对更新（因为e仍然是 0）。所以我怀疑这个顺序可能应该是相反的。

有人可以澄清这一点吗？

我觉得这篇论文对学习强化学习领域很有指导意义，所以想详细了解一下这篇论文。

如果有更合适的平台问这个问题，也请告诉我。

score 3 · Accepted Answer

It looks to me like you're correct, e should be updated before theta. That's also what should happen according to the math in the paper. See, for example, Equations (7) and (8), where e_t is first computed using phi(s_t), and only THEN is theta updated using delta V_t (which would be delta Q in the control case).

Note that what you wrote about the extreme case with lambda=0 is not entirely correct. The initial state-action pair will still be involved in an update (not in the first iteration, but they will be incorporated in e during the second iteration). However, it looks to me like the very first reward r will never be used in any updates (because it only appears in the very first iteration, where e is still 0). Since this paper is about Go, I suspect it will not matter though; unless they're doing something unconventional, they probably only use non-zero rewards for the terminal game state.

algorithm - 资格跟踪算法，更新顺序

1 回答 1

Related

Reference