0

我正在尝试在 databrick 笔记本(或任何笔记本环境)上使用“DistributedDataParallel”策略在 pytorch 中进行分布式训练。但是我在 databricks 笔记本环境中遇到了多处理问题。

问题:我想使用torch.multiprocessing在 databricks 笔记本上创建多个进程。我已经从主代码中提取了问题,以便于理解问题。

import torch.distributed as dist
import torch.multiprocessing as mp

def train():
  print("hello")

if __name__ == '__main__':
  processes = 4
  mp.spawn(train, args=(), nprocs=processes)
  print("completed")

例外:

ProcessExitedException: process 1 terminated with exit code 1
---------------------------------------------------------------------------
ProcessExitedException                    Traceback (most recent call last)
<command-2917251930623656> in <module>
     19 if __name__ == '__main__':
     20   processes = 4
---> 21   mp.spawn(train, args=(), nprocs=processes)
     22   print("completed")
     23 

/databricks/python/lib/python3.8/site-packages/torch/multiprocessing/spawn.py in spawn(fn, args, nprocs, join, daemon, start_method)
    228                ' torch.multiprocessing.start_processes(...)' % start_method)
    229         warnings.warn(msg)
--> 230     return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')

/databricks/python/lib/python3.8/site-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
    186 
    187     # Loop on join until it returns True or raises an exception.
--> 188     while not context.join():
    189         pass
    190 
4

0 回答 0