databricks - 如何在 Azure Databricks 群集上将驱动程序节点 GPU 与 Horovod 结合使用？

Question

当我创建一个包含一个驱动程序 + 两个工作人员的集群时，每个人都有一个 GPU，并尝试在每个 GPU 上启动训练，我会写：

from sparkdl import HorovodRunner 
hr = HorovodRunner(np=3) 
hr.run(train_hvd)

但收到以下错误消息：

HorovodRunner was called with np=3, which is greater than the maximum processes that can be placed
on this cluster. This cluster can place at most 2 processes on 2 executors. Training won't start
until there are enough workers on this cluster. You  can increase the cluster size or cancel the
current run and retry with a smaller np.

显然 HorovodRunner 没有考虑驱动节点上的 GPU（对吗？）。当我使用选项 np=-1（仅驱动 GPU）、np=2（某处 2 个 GPU）或 np=-2（仅驱动但有 2 个 GPU）时，一切正常，即我的功能没有任何问题代码，除此之外我无法让它利用所有 3 个可用的 GPU。

(a) 有没有办法让 Horovod 在分布式学习中包含驱动节点上的 GPU？

(b) 或者：有没有办法在 Databricks 中创建一个包含 GPU 工作人员但非 GPU 驱动程序的集群？

score 0 · Accepted Answer

0

于 2020-01-30T07:03:55.697 回答

databricks - 如何在 Azure Databricks 群集上将驱动程序节点 GPU 与 Horovod 结合使用？

1 回答 1

Related

Reference