python - 当我尝试将值迭代算法与 mdptoolbox 一起使用时出现溢出错误

Question

我为具有 4 种可能状态和 4 种可能动作的板设置了一个简单的 MDP。董事会和奖励设置如下所示：

这里S4是目标状态，S2也是吸收状态。我在编写的代码中定义了转移概率矩阵和奖励矩阵，以获得该 MDP 的最佳值函数。但是当我运行代码时，我收到一条错误消息：OverflowError: cannot convert float infinity to integer. 我不明白这是为什么。

import mdptoolbox
import numpy as np

transitions = np.array([
    # action 1 (Right)
    [
        [0.1, 0.7, 0.1, 0.1],
        [0.3, 0.3, 0.3, 0.1],
        [0.1, 0.2, 0.2, 0.5],
        [0.1,  0.1,  0.1,  0.7]
    ],
    # action 2 (Down)
    [
        [0.1, 0.4, 0.4, 0.1],
        [0.3, 0.3, 0.3, 0.1],
        [0.4, 0.1, 0.4, 0.1],
        [0.1,  0.1,  0.1,  0.7]
    ],
    # action 3 (Left)
    [
        [0.4, 0.3, 0.2, 0.1],
        [0.2, 0.2, 0.4, 0.2],
        [0.5, 0.1, 0.3, 0.1],
        [0.1,  0.1,  0.1,  0.7]
    ],
    # action 4 (Top)
    [
        [0.1, 0.4, 0.4, 0.1],
        [0.3, 0.3, 0.3, 0.1],
        [0.4, 0.1, 0.4, 0.1],
        [0.1,  0.1,  0.1,  0.7]
    ]
])

rewards = np.array([
    [-1, -100, -1, 1],
    [-1, -100, -1, 1],
    [-1, -100, -1, 1],
    [1, 1, 1, 1]
])


vi = mdptoolbox.mdp.ValueIteration(transitions, rewards, discount=0.5)
vi.setVerbose()
vi.run()

print("Value function:")
print(vi.V)


print("Policy function")
print(vi.policy)

如果我将值更改discount为1from 0.5，它工作正常。0.5值迭代不能使用折扣值或任何其他十进制值的原因可能是什么？

更新：我的奖励矩阵似乎有问题。我无法按照我的预期写它。因为如果我改变奖励矩阵中的一些值，错误就会消失。

score 0 · Accepted Answer

所以结果表明我定义的奖励矩阵是不正确的。根据上图中定义的奖励矩阵，它应该是文档(S,A)中给出的类型，其中每一行对应于从until开始的状态，每一列对应于从until开始的动作。新的奖励矩阵如下所示：S1S4A1A4

#(S,A)
rewards = np.array([
    [-1, -1, -1, -1],
    [-100, -100, -100, -100],
    [-1, -1, -1, -1],
    [1, 1, 1, 1]
])

它适用于此。但我仍然不确定，内部发生了什么导致溢出错误。

python - 当我尝试将值迭代算法与 mdptoolbox 一起使用时出现溢出错误

1 回答 1

Related

Reference