我正在学习 q-learning 并找到了一个 Wikipedia 帖子和这个网站。
根据教程和伪代码,我在 R 中写了这么多
#q-learning example
#http://mnemstudio.org/path-finding-q-learning-tutorial.htm
#https://en.wikipedia.org/wiki/Q-learning
set.seed(2016)
iter=100
dimension=5;
alpha=0.1 #learning rate
gamma=0.8 #exploration/ discount factor
# n x n matrix
Q=matrix( rep( 0, len=dimension*dimension), nrow = dimension)
Q
# R -1 is fire pit,0 safe path and 100 Goal state########
R=matrix( sample( -1:0, dimension*dimension,replace=T,prob=c(1,2)), nrow = dimension)
R[dimension,dimension]=100
R #reward matrix
################
for(i in 1:iter){
row=sample(1:dimension,1)
col=sample(1:dimension,1)
I=Q[row,col] #randomly choosing initial state
Q[row,col]=Q[row,col]+alpha*(R[row,col]+gamma*max(Qdash-Q[row,col])
#equation from wikipedia
}
但是我有一个问题max(Qdash-Q[row,col]
,根据网站是Max[Q(next state, all actions)]
如何以编程方式搜索下一个状态的所有操作?
第二个问题是这个伪代码
Do While the goal state hasn't been reached.
Select one among all possible actions for the current state.
Using this possible action, consider going to the next state.
Get maximum Q value for this next state based on all possible actions.
Compute: Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]
Set the next state as the current state.
End Do
是这个吗
while(Q<100){
Q[row,col]=Q[row,col]+alpha*(R[row,col]+gamma*max(Qdash-Q[row,col])
}