python - 实现 TD-Gammon 算法

Question

我正在尝试从Gerald Tesauro的TD-Gammon 文章中实现算法。以下段落描述了学习算法的核心：

我决定有一个隐藏层（如果这足以在 1990 年代初玩世界级的西洋双陆棋，那对我来说就足够了）。我很确定除了train()函数之外的一切都是正确的（它们更容易测试），但我不知道我是否正确地实现了这个最终算法。

import numpy as np

class TD_network:
    """
    Neural network with a single hidden layer and a Temporal Displacement training algorithm
    taken from G. Tesauro's 1995 TD-Gammon article.
    """
    def __init__(self, num_input, num_hidden, num_output, hnorm, dhnorm, onorm, donorm):
        self.w21 = 2*np.random.rand(num_hidden, num_input) - 1
        self.w32 = 2*np.random.rand(num_output, num_hidden) - 1
        self.b2 = 2*np.random.rand(num_hidden) - 1
        self.b3 = 2*np.random.rand(num_output) - 1
        self.hnorm = hnorm
        self.dhnorm = dhnorm
        self.onorm = onorm
        self.donorm = donorm

    def value(self, input):
        """Evaluates the NN output"""
        assert(input.shape == self.w21[1,:].shape)
        h = self.w21.dot(input) + self.b2
        hn = self.hnorm(h)
        o = self.w32.dot(hn) + self.b3
        return(self.onorm(o))

    def gradient(self, input):
        """
        Calculates the gradient of the NN at the given input. Outputs a list of dictionaries
        where each dict corresponds to the gradient of an output node, and each element in
        a given dict gives the gradient for a subset of the weights. 
        """ 
        assert(input.shape == self.w21[1,:].shape)
        J = []
        h = self.w21.dot(input) + self.b2
        hn = self.hnorm(h)
        o = self.w32.dot(hn) + self.b3

        for i in range(len(self.b3)):
            db3 = np.zeros(self.b3.shape)
            db3[i] = self.donorm(o[i])

            dw32 = np.zeros(self.w32.shape)
            dw32[i, :] = self.donorm(o[i])*hn

            db2 = np.multiply(self.dhnorm(h), self.w32[i,:])*self.donorm(o[i])
            dw21 = np.transpose(np.outer(input, db2))

            J.append(dict(db3 = db3, dw32 = dw32, db2 = db2, dw21 = dw21))
        return(J)

    def train(self, input_states, end_result, a = 0.1, l = 0.7):
        """
        Trains the network using a single series of input states representing a game from beginning
        to end, and a final (supervised / desired) output for the end state
        """
        outputs = [self(input_state) for input_state in input_states]
        outputs.append(end_result)
        for t in range(len(input_states)):
            delta = dict(
                db3 = np.zeros(self.b3.shape),
                dw32 = np.zeros(self.w32.shape),
                db2 = np.zeros(self.b2.shape),
                dw21 = np.zeros(self.w21.shape))
            grad = self.gradient(input_states[t])
            for i in range(len(self.b3)):
                for key in delta.keys():
                    td_sum = sum([l**(t-k)*grad[i][key] for k in range(t + 1)])
                    delta[key] += a*(outputs[t + 1][i] - outputs[t][i])*td_sum
            self.w21 += delta["dw21"]
            self.w32 += delta["dw32"]
            self.b2 += delta["db2"]
            self.b3 += delta["db3"]

我使用它的方式是玩一整场游戏（或者更确切地说，神经网络与自己对战），然后我将该游戏的状态从开始到结束发送到train()，以及最终结果。然后它获取这个游戏日志，并应用上述公式使用第一个游戏状态，然后是第一个和第二个游戏状态，以此类推，直到最后一次，当它使用整个游戏状态列表时。然后我重复了很多次，并希望网络学习。

需要明确的是，我不是在对我的代码编写进行反馈。这绝不是一个快速而肮脏的实现，以确保我在正确的位置拥有所有的螺母和螺栓。

但是，我不知道它是否正确，因为到目前为止我还无法让它能够在任何合理的水平上玩井字游戏。这可能有很多原因。也许我没有给它足够的隐藏节点（我使用了 10 到 12 个）。也许它需要更多的游戏来训练（我已经使用了 200 000）。也许使用不同的归一化函数会做得更好（我已经尝试过不同变体的 sigmoid 和 ReLU，有泄漏和无泄漏）。也许学习参数没有正确调整。也许井字游戏及其确定性游戏玩法意味着它“锁定”了游戏树中的某些路径。或者也许培训实施是错误的。这就是我在这里的原因。

我误解了 Tesauro 的算法吗？

score 3 · Accepted Answer

我不能说我完全理解你的实现，但这条线跳出来给我：

                    td_sum = sum([l**(t-k)*grad[i][key] for k in range(t + 1)])

与您引用的公式比较：

我看到至少两个不同之处：

与公式中的t+1元素相比，您的实现对元素求和t
渐变应该使用与kin相同的索引l**(t-k)，但在您的实现中，它使用iand进行索引key，没有任何参考k

也许如果您解决了这些差异，您的解决方案将表现得更符合预期。

python - 实现 TD-Gammon 算法

1 回答 1

Related

Reference