1

我正在阅读Silver et al (2012) "Temporal-Difference Search in Computer Go",并试图了解资格跟踪算法的更新顺序。在论文的算法 1 和 2 中,在更新资格跟踪之前更新了权重。我想知道这个顺序是否正确(算法 1 的第 11 和 12 行,以及算法 2 的第 12 和 13 行)。考虑一个极端的情况lambda=0,参数不会用初始状态-动作对更新(因为e仍然是 0)。所以我怀疑这个顺序可能应该是相反的。

有人可以澄清这一点吗?

我觉得这篇论文对学习强化学习领域很有指导意义,所以想详细了解一下这篇论文。

如果有更合适的平台问这个问题,也请告诉我。

在此处输入图像描述 在此处输入图像描述

4

1 回答 1

3

It looks to me like you're correct, e should be updated before theta. That's also what should happen according to the math in the paper. See, for example, Equations (7) and (8), where e_t is first computed using phi(s_t), and only THEN is theta updated using delta V_t (which would be delta Q in the control case).

Note that what you wrote about the extreme case with lambda=0 is not entirely correct. The initial state-action pair will still be involved in an update (not in the first iteration, but they will be incorporated in e during the second iteration). However, it looks to me like the very first reward r will never be used in any updates (because it only appears in the very first iteration, where e is still 0). Since this paper is about Go, I suspect it will not matter though; unless they're doing something unconventional, they probably only use non-zero rewards for the terminal game state.

于 2018-10-18T17:28:20.063 回答