python - 时间差异学习中的双重计数

Question

我正在研究一个时间差异学习示例（https://www.youtube.com/watch?v=XrxgdpduWOU），我在 python 实现中遇到了以下等式问题，因为我似乎是重复计算奖励和 Q。

如果我将下面的网格编码为二维数组，我的当前位置是 (2, 2)，目标是 (2, 3)，假设最大奖励是 1。让 Q(t) 是我当前位置的平均平均值，然后 r(t+1) 为 1，我假设最大 Q(t+1) 也为 1，这导致我的 Q(t) 接近 2（假设 gamma 为 1）。这是正确的，还是我应该假设 Q(n)，其中 n 是终点是 0？

编辑以包含代码 - 我修改了 get_max_q 函数以返回 0 如果它是终点并且值现在都低于 1（我认为它更正确，因为奖励只是 1）但不确定这是否是正确的方法（以前我将它设置为在终点时返回 1）。

#not sure if this is correct
def get_max_q(q, pos):
    #end point 
    #not sure if I should set this to 0 or 1
    if pos == (MAX_ROWS - 1, MAX_COLS - 1):
        return 0
    return max([q[pos, am] for am in available_moves(pos)])

def learn(q, old_pos, action, reward):
    new_pos = get_new_pos(old_pos, action)
    max_q_next_move = get_max_q(q, new_pos) 

    q[(old_pos, action)] = q[old_pos, action] +  alpha * (reward + max_q_next_move - q[old_pos, action]) -0.04

def move(q, curr_pos):
    moves = available_moves(curr_pos)
    if random.random() < epsilon:
        action = random.choice(moves)
    else:
        index = np.argmax([q[m] for m in moves])
        action = moves[index]

    new_pos = get_new_pos(curr_pos, action)

    #end point
    if new_pos == (MAX_ROWS - 1, MAX_COLS - 1):
        reward = 1
    else:
        reward = 0

    learn(q, curr_pos, action, reward)
    return get_new_pos(curr_pos, action)

=======================
OUTPUT
Average value (after I set Q(end point) to 0)
defaultdict(float,
            {((0, 0), 'DOWN'): 0.5999999999999996,
             ((0, 0), 'RIGHT'): 0.5999999999999996,
              ...
             ((2, 2), 'UP'): 0.7599999999999998})

Average value (after I set Q(end point) to 1)
defaultdict(float,
        {((0, 0), 'DOWN'): 1.5999999999999996,
         ((0, 0), 'RIGHT'): 1.5999999999999996,
         ....
         ((2, 2), 'LEFT'): 1.7599999999999998,
         ((2, 2), 'RIGHT'): 1.92,
         ((2, 2), 'UP'): 1.7599999999999998})

score 1 · Accepted Answer

Q 值表示在剧集结束之前您期望获得多少奖励的估计。因此，在最终状态下，maxQ = 0，因为在那之后您将不再获得任何奖励。因此，Q 值t将是 1，这对于您的未折扣问题是正确的。但是你不能忽略gamma等式中的，将它添加到你的公式中以使其折扣。因此，例如，如果gamma = 0.9，则 Q 值为t0.9。在 (2,1) 和 (1,2) 处，它将是 0.81，依此类推。

python - 时间差异学习中的双重计数

1 回答 1

Related

Reference