0

我正在尝试进入机器学习领域,并决定自己尝试一下。我写了一个小井字游戏。到目前为止,计算机使用随机动作与自己对战。

现在,我想通过编写一个代理来应用强化学习,该代理将根据它对董事会当前状态的了解进行探索或利用。

我不明白的部分是:代理使用什么来训练自己以适应当前状态?假设一个 RNG bot (o) 玩家这样做:

[..][..][..]

[..][x][o]

[..][..][..]

现在代理必须决定最好的移动应该是什么。训练有素的人会选择第 1、第 3、第 7 或第 9 名。它是否在数据库中查找到导致他获胜的类似状态?因为如果是这样,我认为我需要将每一个动作保存到数据库中,直到最终状态(赢/输/平局),这对于单场比赛来说会是相当多的数据吗?

如果我想错了,我想知道如何正确地做到这一点。

4

1 回答 1

2

Learning

1) Observe a current board state s;

2) Make a next move based on the distribution of all available V(s') of next moves. Strictly the choice is often based on Boltzman’s distribution of V(s'), but can be simplified to maximum-value move (greedy) or, with some probability epsilon, a random move as you are using;

3) Record s' in a sequence;

4) If the game finishes, it updates the values of the visited states in the sequence and starts over again; otherwise, go to 1).

Game Playing

1) Observe a current board state s;

2) Make a next move based on the distribution of all available V(s') of next moves;

3) Until the game is over and it starts over again; otherwise, go to 1).

Regarding your question, yes the look-up table in Game Playing phase is built up in the Learning phase. Every time the state is chosen from the all the V(s) with a maximum possible number of 3^9=19683. Here is a sample code written by Python that runs 10000 games in training.

于 2014-02-17T00:13:34.470 回答