0

对于强化学习,通常对情节的每一步应用神经网络的前向传递来计算策略。之后可以使用反向传播计算参数梯度。我的网络的简化实现如下所示:

class AC_Network(object):

    def __init__(self, s_size, a_size, scope, trainer, parameters_net):
        with tf.variable_scope(scope):
            self.is_training = tf.placeholder(shape=[], dtype=tf.bool)
            self.inputs = tf.placeholder(shape=[None, s_size], dtype=tf.float32)
            # (...)
            layer = slim.fully_connected(self.inputs, 
                                         layer_size,
                                         activation_fn=tf.nn.relu,
                                         biases_initializer=None)
            layer = tf.contrib.layers.dropout(inputs=layer, keep_prob=parameters_net["dropout_keep_prob"], 
                                              is_training=self.is_training)

            self.policy = slim.fully_connected(layer, a_size,
                                               activation_fn=tf.nn.softmax,
                                               biases_initializer=None)

            self.actions = tf.placeholder(shape=[None], dtype=tf.int32)
            self.advantages = tf.placeholder(shape=[None], dtype=tf.float32)
            actions_onehot = tf.one_hot(self.actions, a_size, dtype=tf.float32)
            responsible_outputs = tf.reduce_sum(self.policy * actions_onehot, [1])
            self.policy_loss = - policy_loss_multiplier * tf.reduce_mean(tf.log(responsible_outputs) * self.advantages)

             local_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope)
             self.gradients = tf.gradients(self.policy_loss, local_vars)

现在在训练期间,我将通过连续的前向传球(同样,简化版)首先推出这一集:

s = self.local_env.reset() # list of input variables for the first step
while done == False:
    a_dist = sess.run([self.policy],
                      feed_dict = {self.local_AC.inputs: [s],
                                   self.is_training: True})
    a = np.argmax(a_dist)
    s, r, done, extra_stat = self.local_env.step(a)
    # (...)

最后我将通过反向传递计算梯度:

p_l, grad = sess.run([self.policy_loss,
                      self.gradients],
                      feed_dict={self.inputs: np.vstack(comb_observations),
                                 self.is_training: True,
                                 self.actions: np.hstack(comb_actions),})

(请注意,我可能在上面的某个地方犯了一个错误,试图尽可能多地删除与问题无关的原始代码)

所以最后的问题是:有没有办法确保对 sess.run() 的所有连续调用都会生成相同的 dropout 结构?理想情况下,我希望在每一集内都有完全相同的 dropout 结构,并且只在两集之间改变它。事情似乎运作良好,但我继续想知道。

4

0 回答 0