machine-learning - 强化学习-TD learning from afterstates

Question

我正在制作一个程序，教 2 名玩家使用强化学习和基于后态的时间差异学习方法 (TD(λ) ) 玩一个简单的棋盘游戏。学习是通过训练神经网络来实现的。我使用Sutton 的非线性 TD/Backprop 神经网络）我真的很想听听您对我以下困境的看法。在两个对手之间进行转牌的基本算法/伪代码是这样的

WHITE.CHOOSE_ACTION(GAME_STATE); //White player decides on its next move by evaluating the current game state ( TD(λ) learning)

GAME_STATE = WORLD.APPLY(WHITE_PLAYERS_ACTION);  //We apply the chosen action of the player to the environment and a new game state emerges

 IF (GAME STATE != FINAL ){ // If the new state is not final (not a winning state for white player), do the same for the Black player

    BLACK.CHOOSE_ACTION(GAME_STATE)

GAME_STATE = WORLD.APPLY(BLACK_PLAYERS_ACTION) // We apply the chosen action of the black player to the environment and a new game state emerges.
}

每个玩家何时应该调用他的学习方法 PLAYER.LEARN(GAME_STATE)。这是困境。

选项 A. 在每个玩家移动之后，在新的后续状态出现之后，如下所示：

WHITE.CHOOSE_ACTION(GAME_STATE);
GAME_STATE = WORLD.APPLY(WHITE_PLAYERS_ACTION);
WHITE.LEARN(GAME_STATE)    // White learns from the afterstate that emerged right after his action
IF (GAME STATE != FINAL ){
    BLACK.CHOOSE_ACTION(GAME_STATE)
    GAME_STATE = WORLD.APPLY(BLACK_PLAYERS_ACTION)
    BLACK.LEARN(GAME_STATE) // Black learns from the afterstate that emerged right after his action

选项 B. 在每个玩家移动之后，在新的后态出现之后，以及在对手移动之后，如果对手取得了胜利。

WHITE.CHOOSE_ACTION(GAME_STATE);
GAME_STATE = WORLD.APPLY(WHITE_PLAYERS_ACTION);
WHITE.LEARN(GAME_STATE)
IF (GAME_STATE == FINAL ) //If white player won
    BLACK.LEARN(GAME_STATE) // Make the Black player learn from the White player's winning afterstate
IF (GAME STATE != FINAL ){ //If white player's move did not produce a winning/final afterstate
    BLACK.CHOOSE_ACTION(GAME_STATE)
    GAME_STATE = WORLD.APPLY(BLACK_PLAYERS_ACTION)
    BLACK.LEARN(GAME_STATE)
    IF (GAME_STATE == FINAL) //If Black player won
        WHITE.LEARN(GAME_STATE) //Make the White player learn from the Black player's winning afterstate

我认为B选项更合理。

score 0 · Accepted Answer

通常，使用 TD 学习，代理将具有 3 个功能：

开始（观察）→ 行动
步骤（观察，奖励）→ 行动
完成（奖励）

行动与学习相结合，游戏结束时也会发生更多的学习。

machine-learning - 强化学习-TD learning from afterstates

1 回答 1

Related

Reference