我正在尝试在 databrick 笔记本(或任何笔记本环境)上使用“DistributedDataParallel”策略在 pytorch 中进行分布式训练。但是我在 databricks 笔记本环境中遇到了多处理问题。
问题:我想使用torch.multiprocessing在 databricks 笔记本上创建多个进程。我已经从主代码中提取了问题,以便于理解问题。
import torch.distributed as dist
import torch.multiprocessing as mp
def train():
print("hello")
if __name__ == '__main__':
processes = 4
mp.spawn(train, args=(), nprocs=processes)
print("completed")
例外:
ProcessExitedException: process 1 terminated with exit code 1
---------------------------------------------------------------------------
ProcessExitedException Traceback (most recent call last)
<command-2917251930623656> in <module>
19 if __name__ == '__main__':
20 processes = 4
---> 21 mp.spawn(train, args=(), nprocs=processes)
22 print("completed")
23
/databricks/python/lib/python3.8/site-packages/torch/multiprocessing/spawn.py in spawn(fn, args, nprocs, join, daemon, start_method)
228 ' torch.multiprocessing.start_processes(...)' % start_method)
229 warnings.warn(msg)
--> 230 return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
/databricks/python/lib/python3.8/site-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
186
187 # Loop on join until it returns True or raises an exception.
--> 188 while not context.join():
189 pass
190