0

我已经从使用单个 gpu 转变为使用多个 gpu。代码抛出错误

    epoch       main/loss   validation/main/loss  elapsed_time
   Exception in main training loop: '<' not supported between instances of 
    'list' and 'int'
       Traceback (most recent call last):
   File "/home/ubuntu/anaconda3/envs/chainer_p36/lib/python3.6/site- 
       packages/chainer/training/trainer.py", line 318, in run
       entry.extension(self)
   File "/home/ubuntu/anaconda3/envs/chainer_p36/lib/python3.6/site- 
        packages/chainer/training/extensions/evaluator.py", line 157, in 
        __call__
         result = self.evaluate()
    File "/home/ubuntu/anaconda3/envs/chainer_p36/lib/python3.6/site- 
         packages/chainer/training/extensions/evaluator.py", line 206, in evaluate
       in_arrays = self.converter(batch, self.device)
    File "/home/ubuntu/anaconda3/envs/chainer_p36/lib/python3.6/site- 
       packages/chainer/dataset/convert.py", line 150, in concat_examples
       return to_device(device, _concat_arrays(batch, padding))
    File "/home/ubuntu/anaconda3/envs/chainer_p36/lib/python3.6/site- 
       packages/chainer/dataset/convert.py", line 35, in to_device
          elif device < 0:

将在重新提出异常之前完成培训师扩展和更新程序。

我试过不使用gpu它工作得很好。但是当使用单个 gpu 时,出现内存不足的错误。所以,移动了 p28xlarge 实例,现在它抛出了上述错误。问题出在哪里以及如何解决?

使用 8 个 gpu 完成更改

     num_gpus = 8
     chainer.cuda.get_device_from_id(0).use()

3.#更新程序

     if num_gpus > 0:

        updater = training.updater.ParallelUpdater(
        train_iter,
        optimizer,
        devices={('main' if device == 0 else str(device)): device for 
                 device in range(num_gpus)},
    )
    else:
        updater = training.updater.StandardUpdater(train_iter, optimizer, 
                    device=args.gpus)

4.and 儿子.. 5.Training :

       trainer.run()

输出 -- epoch main/loss validation/main/loss elapsed_time 主训练循环中的异常:在 'list' 和 'int' 的实例之间不支持 '<'

我期望输出为

          epoch       main/loss   validation/main/loss  elapsed_time
           1.         
           2. 
           3. and so on till it converge's.
4

1 回答 1

0

Evaluator将数据传输到指定的device. 你如何指定deviceto Evalutor.__init__?请注意,它应该是单个设备。也许这个例子可以作为参考https://github.com/chainer/chainer/blob/master/examples/mnist/train_mnist_data_parallel.py

于 2019-06-17T03:24:56.987 回答