我想问,一个连续动作空间的策略的神经网络输出是如何组织的?
我知道 PPO 中的输出有均值和标准。开发。给定动作的价值。但是,这是如何组织的?例如,代理有 2 个动作。我们得到:
mean_0 - std_dev_0 - mean_1 - std_dev_1
或者:
mean_0 - mean_1 - std_dev_0 - std_dev_1
我在源代码中搜索了采样器功能。然而,我什么也没找到。
我想问,一个连续动作空间的策略的神经网络输出是如何组织的?
我知道 PPO 中的输出有均值和标准。开发。给定动作的价值。但是,这是如何组织的?例如,代理有 2 个动作。我们得到:
mean_0 - std_dev_0 - mean_1 - std_dev_1
或者:
mean_0 - mean_1 - std_dev_0 - std_dev_1
我在源代码中搜索了采样器功能。然而,我什么也没找到。
In Ray 1.10.0, I debug an PPO example and find that these codes are related.
In torch_policy.py, you can find the following codes:
if self.action_distribution_fn:
# Try new action_distribution_fn signature, supporting
# state_batches and seq_lens.
try:
dist_inputs, dist_class, state_out = \
self.action_distribution_fn(
self,
self.model,
input_dict=input_dict,
state_batches=state_batches,
seq_lens=seq_lens,
explore=explore,
timestep=timestep,
is_training=False)
# Trying the old way (to stay backward compatible).
# TODO: Remove in future.
except TypeError as e:
if "positional argument" in e.args[0] or \
"unexpected keyword argument" in e.args[0]:
dist_inputs, dist_class, state_out = \
self.action_distribution_fn(
self,
self.model,
input_dict[SampleBatch.CUR_OBS],
explore=explore,
timestep=timestep,
is_training=False)
else:
raise e
else:
dist_class = self.dist_class
dist_inputs, state_out = self.model(input_dict, state_batches,
seq_lens)
if not (isinstance(dist_class, functools.partial)
or issubclass(dist_class, TorchDistributionWrapper)):
raise ValueError(
"`dist_class` ({}) not a TorchDistributionWrapper "
"subclass! Make sure your `action_distribution_fn` or "
"`make_model_and_action_dist` return a correct "
"distribution class.".format(dist_class.__name__))
action_dist = dist_class(dist_inputs, self.model)
Note that the dist_inputs includes the mean and std for PPO. Then for a distribution that is TorchDiagGaussian class, you can refer to torch_action_dist.py. In my debugged script, the parameter 'inputs' is the dist_inputs mentioned above.
class TorchDiagGaussian(TorchDistributionWrapper):
"""Wrapper class for PyTorch Normal distribution."""
@override(ActionDistribution)
def __init__(self, inputs: List[TensorType], model: TorchModelV2):
super().__init__(inputs, model)
mean, log_std = torch.chunk(self.inputs, 2, dim=1)
self.dist = torch.distributions.normal.Normal(mean, torch.exp(log_std))
Accordding to this, I think the neural network output should work in the second way you describe.