It looks to me like you're correct, e
should be updated before theta
. That's also what should happen according to the math in the paper. See, for example, Equations (7) and (8), where e_t
is first computed using phi(s_t)
, and only THEN is theta
updated using delta V_t
(which would be delta Q
in the control case).
Note that what you wrote about the extreme case with lambda=0
is not entirely correct. The initial state-action pair will still be involved in an update (not in the first iteration, but they will be incorporated in e
during the second iteration). However, it looks to me like the very first reward r
will never be used in any updates (because it only appears in the very first iteration, where e
is still 0
). Since this paper is about Go, I suspect it will not matter though; unless they're doing something unconventional, they probably only use non-zero rewards for the terminal game state.