我正在尝试遵循一位受欢迎的 youtuber 制作的关于自定义 openai 健身房环境的教程,但无法复制他的结果。
我最初将我的模型设置为
model = PPO("MlpPolicy", env, verbose=1, tensorboard_log=log_path)
训练了 500K 步
model.learn(total_timesteps=500000)
但它似乎根本没有改善,奖励保持在 0,标准在 58-60 之间我检查了这个
episode_result = evaluate_policy(model, env, n_eval_episodes=10)
print("reward: {} std: {} ".format(episode_result[0], episode_result[1]))
自定义环境是
class ShowerEnv(Env):
def __init__(self):
# Actions we can take, down, stay, up
self.action_space = Discrete(3)
# Temperature array
self.observation_space = Box(low=np.array([0]), high=np.array([100]))
# Set start temp
self.state = 38 + random.randint(-3,3)
# Set shower length
self.shower_length = 60
def step(self, action):
# Apply action
self.state += action -1
# Reduce shower length by 1 second
self.shower_length -= 1
# Calculate reward
if self.state >=37 and self.state <=39:
reward =1
else:
reward = -1
# Check if shower is done
if self.shower_length <= 0:
done = True
else:
done = False
# Set placeholder for info
info = {}
# Return step information
return self.state, reward, done, info
def render(self, mode):
pass
def reset(self):
# Reset shower temperature
self.state = np.array([38 + random.randint(-3,3)]).astype(float)
# Reset shower time
self.shower_length = 60
return self.state
任何帮助将不胜感激!