python - 为什么我的 Deep Q Network 没有掌握一个简单的 Gridworld (Tensorflow)？（如何评估 Deep-Q-Net）

Question

我尝试熟悉 Q-learning 和深度神经网络，目前尝试使用深度强化学习来实现 Playing Atari。

为了测试我的实现并使用它，我坚持尝试一个简单的网格世界。我有一个 N x N 网格，从左上角开始，在右下角结束。可能的动作有：左、上、右、下。

尽管我的实现与此非常相似（希望它是一个好的实现），但它似乎并没有学到任何东西。看看它需要完成的总步骤（我猜平均会在 500 左右，网格大小为 10x10，但也有非常低和高的值），它对我来说比其他任何东西都更加随机。

我尝试了使用和不使用卷积层并使用了所有参数，但老实说，我不知道我的实现是否有问题或者它需要训练更长时间（我让它训练了相当长的时间）或者什么曾经。但至少它似乎收敛了，这里是损失值的情节一个训练：

那么在这种情况下有什么问题呢？

但也许更重要的是，我如何“调试”这个 Deep-Q-Nets，在监督训练中有训练、测试和验证集，例如，通过精确度和召回率可以评估它们。对于使用 Deep-Q-Nets 进行无监督学习，我有哪些选择，以便下次我可以自己修复它？

最后是代码：

这是网络：

ACTIONS = 5

# Inputs
x = tf.placeholder('float', shape=[None, 10, 10, 4])
y = tf.placeholder('float', shape=[None])
a = tf.placeholder('float', shape=[None, ACTIONS])

# Layer 1 Conv1 - input
with tf.name_scope('Layer1'):
    W_conv1 = weight_variable([8,8,4,8])
    b_conv1 = bias_variable([8])    
    h_conv1 = tf.nn.relu(conv2d(x, W_conv1, 5)+b_conv1)

# Layer 2 Conv2 - hidden1 
with tf.name_scope('Layer2'):
    W_conv2 = weight_variable([2,2,8,8])
    b_conv2 = bias_variable([8])
    h_conv2 = tf.nn.relu(conv2d(h_conv1, W_conv2, 1)+b_conv2)
    h_conv2_max_pool = max_pool_2x2(h_conv2)

# Layer 3 fc1 - hidden 2
with tf.name_scope('Layer3'):
    W_fc1 = weight_variable([8, 32])
    b_fc1 = bias_variable([32])
    h_conv2_flat = tf.reshape(h_conv2_max_pool, [-1, 8])
    h_fc1 = tf.nn.relu(tf.matmul(h_conv2_flat, W_fc1)+b_fc1)

# Layer 4 fc2 - readout
with tf.name_scope('Layer4'):
    W_fc2 = weight_variable([32, ACTIONS])
    b_fc2 = bias_variable([ACTIONS])
    readout = tf.matmul(h_fc1, W_fc2)+ b_fc2

# Training
with tf.name_scope('training'):
    readout_action = tf.reduce_sum(tf.mul(readout, a), reduction_indices=1)
    loss = tf.reduce_mean(tf.square(y - readout_action))
    train = tf.train.AdamOptimizer(1e-6).minimize(loss)

    loss_summ = tf.scalar_summary('loss', loss)

这里是培训：

# 0 => left
# 1 => up
# 2 => right
# 3 => down
# 4 = noop

ACTIONS = 5
GAMMA = 0.95
BATCH = 50
TRANSITIONS = 2000
OBSERVATIONS = 1000
MAXSTEPS = 1000

D = deque()
epsilon = 1

average = 0
for episode in xrange(1000):
    step_count = 0
    game_ended = False

    state = np.array([0.0]*100, float).reshape(100)
    state[0] = 1

    rsh_state = state.reshape(10,10)
    s = np.stack((rsh_state, rsh_state, rsh_state, rsh_state), axis=2)

    while step_count < MAXSTEPS and not game_ended:
        reward = 0
        step_count += 1

        read = readout.eval(feed_dict={x: [s]})[0]

        act = np.zeros(ACTIONS)
        action = random.randint(0,4)
        if len(D) > OBSERVATIONS and random.random() > epsilon:
            action = np.argmax(read)
        act[action] = 1

        # play the game
        pos_idx = state.argmax(axis=0)
        pos = pos_idx + 1

        state[pos_idx] = 0
        if action == 0 and pos%10 != 1: #left
            state[pos_idx-1] = 1
        elif action == 1 and pos > 10: #up
            state[pos_idx-10] = 1
        elif action == 2 and pos%10 != 0: #right
            state[pos_idx+1] = 1
        elif action == 3 and pos < 91: #down
            state[pos_idx+10] = 1
        else: #noop
            state[pos_idx] = 1
            pass

        if state.argmax(axis=0) == pos_idx and reward > 0:
            reward -= 0.0001

        if step_count == MAXSTEPS:
            reward -= 100
        elif state[99] == 1: # reward & finished
            reward += 100
            game_ended = True
        else:
            reward -= 1


        s_old = np.copy(s)
        s = np.append(s[:,:,1:], state.reshape(10,10,1), axis=2)

        D.append((s_old, act, reward, s))
        if len(D) > TRANSITIONS:
            D.popleft()

        if len(D) > OBSERVATIONS:
            minibatch = random.sample(D, BATCH)

            s_j_batch = [d[0] for d in minibatch]
            a_batch = [d[1] for d in minibatch]
            r_batch = [d[2] for d in minibatch]
            s_j1_batch = [d[3] for d in minibatch]

            readout_j1_batch = readout.eval(feed_dict={x:s_j1_batch})
            y_batch = []

            for i in xrange(0, len(minibatch)):
                y_batch.append(r_batch[i] + GAMMA * np.max(readout_j1_batch[i]))

            train.run(feed_dict={x: s_j_batch, y: y_batch, a: a_batch})

        if epsilon > 0.05:
            epsilon -= 0.01

感谢您的每一个帮助和想法！

score 5 · Accepted Answer

对于那些感兴趣的人，我进一步调整了参数和模型，但最大的改进是切换到一个简单的前馈网络，它有 3 层，隐藏层中大约有 50 个神经元。对我来说，它在相当不错的时间内收敛了。

顺便说一句，有关调试的更多提示表示赞赏！

score 1 · Accepted Answer

所以很久以前我写了这个问题，但似乎仍然有一些对运行代码的兴趣和要求，我最终决定创建一个 github 存储库

因为它是很久以前我写的，所以它不会开箱即用，但让它运行起来应该不难。所以这里是我当时写的 deep q 网络和示例，希望你喜欢：链接到 deep q 存储库

很高兴看到一些贡献，如果你修复它并让它运行，请提出拉取请求！

score 0 · Accepted Answer

我已经实现了一个没有 CNN 层的简单玩具 DQN，它可以工作。以下是我在实施过程中的一些发现，希望对您有所帮助。

根据 DeepMind 的论文，他们没有使用最大池化层，原因是图像会变得位置不变，这对游戏不利。智能体的位置对于博弈的信息至关重要。 DQN 架构
如果你想跳过 CNN first use gym 环境（就像我为玩具实现所做的那样），在我的开发过程中，我发现了以下几件事：
- 通过 one-hot 编码对您的环境状态进行编码，这将提高训练效率。
- 我只使用具有 [状态数，动作数] 形状的权重矩阵，与输入的 one-hot 编码状态进行矩阵乘法。没有偏差，没有激活函数（我假设它会增加训练时间，在我添加其他层或任何东西后它永远不会起作用）。

这是我发现对我的实现非常重要的两件事，我并不完全理解其背后的原因，希望我的回答能给你一点启示。

python - 为什么我的 Deep Q Network 没有掌握一个简单的 Gridworld (Tensorflow)？（如何评估 Deep-Q-Net）

3 回答 3

Related

Reference