0

我正在训练一个策略梯度,一个使用 keras 的深度模型。在DeepRLHacks之后,我Entropy在行动空间中使用以观察策略的行为。但是,我对熵的实现和观察结果有疑问。

我目前的策略是一个简单的 FCNN,如下所示:

import tensorflow as tf
import tensorflow.keras.losses as kls
from tensorflow.keras.optimizers import Adam

policy_input = tf.keras.layers.Input(shape=(1, 1))
intermediate = tf.keras.layers.Dense(8, activation="relu", use_bias=True, kernel_initializer=tf.keras.initializers.he_normal(), name="dense_01")(policy_input)
output = tf.keras.layers.Dense(action_size, activation="softmax", use_bias=True, name="dense_04")(intermediate)
policy = tf.keras.models.Model(inputs=policy_input, outputs=output)
opt = Adam(lr=self.learning_rate)
policy.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"])

该策略在当前状态下提供了下一个可能的操作,如下所示:

def get_action(state, step)
    action_probs = policy.predict(state, batch_size=1).flatten()
        
    # Entropy loss can be calculated as cross-entropy over itself.
    entropy = kls.categorical_crossentropy(action_probs, action_probs).numpy()
    
    #send entropy to tensorboard
    tf.summary.scalar('entropy', entropy, step=step)

    probs = action_probs / np.sum(action_probs)
    action = np.random.choice(self.action_size, 1, p=probs)[0]
    return action, probs

熵图如下所示:

在此处输入图像描述

基于DeepRLHacks

一个。如果熵没有下降,那么策略就不好,因为它真的是随机的 ->It looks like the entropy in my case is good enough to pass this one.

湾。如果熵下降得太快,那么策略就会变得确定性并且不会探索。->The entropy in my case is going to zero after 150 episodes. Is it too fast?

另外,ACTION SPACE 中熵的计算是否就在上面?任何建议/指南将不胜感激。

4

0 回答 0