python - 了解 OpenAI 健身房和稳定基线中的多智能体学习

Question

如本文所述，我正在尝试使用 OpenAI 稳定基线和健身房开发多智能体强化学习模型。

我对我们如何指定对手代理感到困惑。似乎对手被传递到环境中，agent2如下所示：

class ConnectFourGym:
    def __init__(self, agent2="random"):
        ks_env = make("connectx", debug=True)
        self.env = ks_env.train([None, agent2])

该ks_env.train()方法似乎来自kaggle_environments.Environment：

def train(self, agents=[]):
    """
    Setup a lightweight training environment for a single agent.
    Note: This is designed to be a lightweight starting point which can
          be integrated with other frameworks (i.e. gym, stable-baselines).
          The reward returned by the "step" function here is a diff between the
          current and the previous step.
    Example:
        env = make("tictactoe")
        # Training agent in first position (player 1) against the default random agent.
        trainer = env.train([None, "random"])

Q1。然而我很困惑。为什么ConnectFourGym.__init__()调用train()方法？那就是为什么环境应该做培训？我觉得，train()应该是模型的一部分：上面的文章使用了包含train()方法的 PPO 算法。PPO.train()当我们调用时，它会被调用PPO.learn()，这是有道理的。

Q2。但是，阅读PPO.learn()的代码，我看不出它是如何训练当前代理对抗多个对手代理的。模型算法不应该这样做吗？读错了吗？或者模型不知道代理的数量，它只是环境已知，这就是为什么环境包含train()？在那种情况下，为什么我们有明确的Environment.train()方法？环境将根据多个代理行为返回奖励，模型将从中学习。

还是完全搞砸了基本概念？somoene 能帮帮我吗？

python - 了解 OpenAI 健身房和稳定基线中的多智能体学习

0 回答 0

Related

Reference