我要解决的问题其实不是这么简单,而是一种玩具游戏,可以帮助我解决更大的问题。
所以我有一个 5x5 矩阵,其值都等于 0 :
structure = np.zeros(25).reshape(5, 5)
目标是让代理将所有值变成 1,所以我有:
goal_structure = np.ones(25).reshape(5, 5)
我创建了一个具有 5 个动作的类 Player,可以向左、向右、向上、向下或翻转(将值 0 变为 1 或 1 变为 0)。对于奖励,如果代理将值 0 更改为 1,则获得 +1 奖励。如果它将 1 变成 0 得到负奖励(我尝试了从 -1 到 0 甚至 -0.1 的许多值)。如果它只是向左、向右、向上或向下移动,它会得到奖励 0。
因为我想将状态提供给我的神经网络,所以我将状态重塑如下:
reshaped_structure = np.reshape(structure, (1, 25))
然后我将代理的标准化位置添加到这个数组的末尾(因为我认为代理应该知道它在哪里):
reshaped_state = np.append(reshaped_structure, (np.float64(self.x/4), np.float64(self.y/4)))
state = reshaped_state
但我没有得到任何好的结果!它就像它的随机一样!我尝试了不同的奖励功能,不同的优化算法,例如体验回放,目标网络,Double DQN,决斗,但似乎都不起作用!我想问题在于定义状态。任何人都可以帮助我定义一个好的状态吗?
非常感谢!
ps:这是我的阶梯函数:
class Player:
def __init__(self):
self.x = 0
self.y = 0
self.max_time_step = 50
self.time_step = 0
self.reward_list = []
self.sum_reward_list = []
self.sum_rewards = []
self.gather_positions = []
# self.dict = {}
self.action_space = spaces.Discrete(5)
self.observation_space = 27
def get_done(self, time_step):
if time_step == self.max_time_step:
done = True
else:
done = False
return done
def flip_pixel(self):
if structure[self.x][self.y] == 1:
structure[self.x][self.y] = 0.0
elif structure[self.x][self.y] == 0:
structure[self.x][self.y] = 1
def step(self, action, time_step):
reward = 0
if action == right:
if self.y < y_threshold:
self.y = self.y + 1
else:
self.y = y_threshold
if action == left:
if self.y > y_min:
self.y = self.y - 1
else:
self.y = y_min
if action == up:
if self.x > x_min:
self.x = self.x - 1
else:
self.x = x_min
if action == down:
if self.x < x_threshold:
self.x = self.x + 1
else:
self.x = x_threshold
if action == flip:
self.flip_pixel()
if structure[self.x][self.y] == 1:
reward = 1
else:
reward = -0.1
self.reward_list.append(reward)
done = self.get_done(time_step)
reshaped_structure = np.reshape(structure, (1, 25))
reshaped_state = np.append(reshaped_structure, (np.float64(self.x/4), np.float64(self.y/4)))
state = reshaped_state
return state, reward, done
def reset(self):
structure = np.zeros(25).reshape(5, 5)
reset_reshaped_structure = np.reshape(structure, (1, 25))
reset_reshaped_state = np.append(reset_reshaped_structure, (0, 0))
state = reset_reshaped_state
self.x = 0
self.y = 0
self.reward_list = []
self.gather_positions = []
# self.dict.clear()
return state