0

我正在使用光线调整运行超参数优化,并且遇到 MemoryError 崩溃。

从概念上讲,代码如下: 训练函数:

 def training_func(config, checkpoint_dir=None):                                                                                                                                                                                                           
      args = config["args"]
      train_args = config["train_args"]
      python_class = config["class"]                                                                                                                                                                                                                         
      model = python_class(**args)                                                                                                                                                                                                                              
      if checkpoint_dir:                                                                                                                                                                                                                                                                                                                                                                                                                                  
          model.load(checkpoint_dir)                                                                                                                                                                                                                        
      else:                                                                                                                                                                                                                                                 
          scores = model.traintest(**train_args)                                                                                                                                                                                                         
          with tune.checkpoint_dir(step=0) as checkpoint_dir:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
              model.save(checkpoint_dir)                                                                                                                                                                                                                    
              # save some data                                                                                                                                                                                                          
              if "some_key" in training_args:                                                                                                                                                                                                                 
                  with open(join(checkpoint_dir, "data.pkl"), "wb") as f:                                                                                                                                                                                 
                      pickle.dump(train_args["some_key"], f)                                                                                                                                                                                               
      del model                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
      gc.collect()                                                                                                                                                                                                                                                                                                                                                                                                                         
      tune.report(**scores)                                              

调调:

    # [.....]
    ray.shutdown()
    ray.init()
    config = { 
            "python_class": python_class,
            "train_args": some_dictionary,
            "args": some_other_dictionary}

    analysis = tune.run(
        training_func, progress_reporter=tune.CLIReporter(),
        metric="macro_avg_f1-score",
        keep_checkpoints_num=1,
        num_samples=1,
        local_dir="/path/to/tune/results/",
        config=config)

运行它,内存使用量持续增长,尽管在目标函数训练后释放了实例化模型。调优迭代完成后如何丢弃资源?如果这不可能,我如何确保只有一个调优实验同时运行(以便我可以手动从内存中提取不需要的变量)?

SO中的这些问题是相似的,我已经采纳了他们的一些建议(例如明确调用GC),但问题没有解决。

1 2 3 4

4

0 回答 0