machine-learning - 深度强化学习训练精度

Question

我正在使用深度强化学习方法来预测时间序列行为。我是一个新手，所以我的问题比计算机编程问题更具概念性。我的同事给了我下面的图表，其中包含使用深度强化学习对时间序列数据分类的训练、验证和测试准确性。

从这张图中，可以看出验证和测试的准确率都是随机的，所以，当然，代理是过拟合的。

但更让我吃惊的（可能是因为缺乏知识，这也是我在这里问你的原因），是我的同事是如何训练他的经纪人的。在这张图表的 X 轴上，您可以找到“纪元”数（或迭代次数）。换句话说，智能体被多次拟合（或训练），如下面的代码所示：

#initiating the agent

self.agent = DQNAgent(model=self.model, policy=self.policy, 
nb_actions=self.nbActions, memory=self.memory, nb_steps_warmup=200, 
target_model_update=1e-1, 
enable_double_dqn=True,enable_dueling_network=True)

#Compile the agent with the Adam optimizer and with the mean absolute error metric

self.agent.compile(Adam(lr=1e-3), metrics=['mae'])

#there will be 100 iterations, I will fit and test the agent 100 times
for i in range(0,100):
    #delete previous environments and create new ones         
    del(trainEnv)       
    trainEnv = SpEnv(parameters)
    del(validEnv)
    validEnv=SpEnv(parameters)
    del(testEnv)
    testEnv=SpEnv(parameters)

   #Reset the callbacks used to show the metrics while training, validating and testing
   self.trainer.reset()
   self.validator.reset()
   self.tester.reset()

   ####TRAINING STEP#### 
   #Reset the training environment
   trainEnv.resetEnv()
   #Train the agent
   self.agent.fit(trainEnv,nb_steps=floor(self.trainSize.days-self.trainSize.days*0.2),visualize=False,verbose=0)
   #Get metrics from the train callback  
   (metrics)=self.trainer.getInfo()
   #################################

   ####VALIDATION STEP####
   #Reset the validation environment
   validEnv.resetEnv()
   #Test the agent on validation data
   self.agent.test(validEnv,other_parameters)
   #Get the info from the validation callback
   (metrics)=self.validator.getInfo()
   ####################################             

   ####TEST STEP####
   #Reset the testing environment
   testEnv.resetEnv()
   #Test the agent on testing data            
   self.agent.test(testEnv,nb_episodes=floor(self.validationSize.days-self.validationSize.days*0.2),visualize=False,verbose=0)
   #Get the info from the testing callback
   (metrics)=self.tester.getInfo()

根据图表和代码，令我感到奇怪的是，代理被拟合了几次，彼此独立，但训练准确度随着时间的推移而增加。似乎以前的经验正在帮助代理提高训练准确性。但是，如果环境被重置并且代理再次安装，这怎么可能呢？是否存在来自先前拟合的误差的反向传播，这有助于代理在下一次拟合中提高其准确性？

score 2 · Accepted Answer

2

重置的是环境，而不是代理。所以代理实际上是从每次迭代中积累经验。

于 2019-06-04T14:49:17.907 回答

score 2 · Accepted Answer

环境正在重置，但代理没有。

可学习的参数属于代理，而不是环境。因此，代理的参数在所有情节中都在变化，即，每次您拟合数据时，代理都在学习。

如果您拟合的所有时间数据都相同，那么它只会使我们的代理过度拟合数据分布

machine-learning - 深度强化学习训练精度

2 回答 2

Related

Reference