python - RLLIB PPO 算法中的神经网络输出

Question

我想问，一个连续动作空间的策略的神经网络输出是如何组织的？

我知道 PPO 中的输出有均值和标准。开发。给定动作的价值。但是，这是如何组织的？例如，代理有 2 个动作。我们得到：

mean_0 - std_dev_0 - mean_1 - std_dev_1

或者：

mean_0 - mean_1 - std_dev_0 - std_dev_1

我在源代码中搜索了采样器功能。然而，我什么也没找到。

score 0 · Accepted Answer

In Ray 1.10.0, I debug an PPO example and find that these codes are related.

In torch_policy.py, you can find the following codes:

if self.action_distribution_fn:
    # Try new action_distribution_fn signature, supporting
    # state_batches and seq_lens.
    try:
        dist_inputs, dist_class, state_out = \
            self.action_distribution_fn(
                self,
                self.model,
                input_dict=input_dict,
                state_batches=state_batches,
                seq_lens=seq_lens,
                explore=explore,
                timestep=timestep,
                is_training=False)
    # Trying the old way (to stay backward compatible).
    # TODO: Remove in future.
    except TypeError as e:
        if "positional argument" in e.args[0] or \
                "unexpected keyword argument" in e.args[0]:
            dist_inputs, dist_class, state_out = \
                self.action_distribution_fn(
                    self,
                    self.model,
                    input_dict[SampleBatch.CUR_OBS],
                    explore=explore,
                    timestep=timestep,
                    is_training=False)
        else:
            raise e
else:
    dist_class = self.dist_class
    dist_inputs, state_out = self.model(input_dict, state_batches,
                                        seq_lens)

if not (isinstance(dist_class, functools.partial)
        or issubclass(dist_class, TorchDistributionWrapper)):
    raise ValueError(
        "`dist_class` ({}) not a TorchDistributionWrapper "
        "subclass! Make sure your `action_distribution_fn` or "
        "`make_model_and_action_dist` return a correct "
        "distribution class.".format(dist_class.__name__))
action_dist = dist_class(dist_inputs, self.model)

Note that the dist_inputs includes the mean and std for PPO. Then for a distribution that is TorchDiagGaussian class, you can refer to torch_action_dist.py. In my debugged script, the parameter 'inputs' is the dist_inputs mentioned above.

class TorchDiagGaussian(TorchDistributionWrapper):
"""Wrapper class for PyTorch Normal distribution."""

@override(ActionDistribution)
def __init__(self, inputs: List[TensorType], model: TorchModelV2):
    super().__init__(inputs, model)
    mean, log_std = torch.chunk(self.inputs, 2, dim=1)
    self.dist = torch.distributions.normal.Normal(mean, torch.exp(log_std))

Accordding to this, I think the neural network output should work in the second way you describe.

python - RLLIB PPO 算法中的神经网络输出

1 回答 1

Related

Reference