我正在尝试 在 databricks GPU 集群(带有 1 个驱动程序和 2 个工作人员)上运行一些示例 python3 代码https://docs.databricks.com/applications/deep-learning/distributed-training/horovod-runner.html 。
数据砖环境:
ML 6.6, scala 2.11, Spark 2.4.5, GPU
它用于分布式深度学习模型训练。
一开始我只是尝试了一个非常简单的例子:
from sparkdl import HorovodRunner
hr = HorovodRunner(np=2)
def train():
print('in train')
import tensorflow as tf
print('after import tf')
hvd.init()
print('done')
hr.run(train)
但是,该命令一直在运行,没有任何进展。
HorovodRunner will stream all training logs to notebook cell output. If there are too many
logs, you
can adjust the log level in your train method. Or you can set driver_log_verbosity to
'log_callback_only' and use a HorovodRunner log callback on the first worker to get concise
progress updates.
The global names read or written to by the pickled function are {'print', 'hvd'}.
The pickled object size is 1444 bytes.
### How to enable Horovod Timeline? ###
HorovodRunner has the ability to record the timeline of its activity with Horovod Timeline.
To
record a Horovod Timeline, set the `HOROVOD_TIMELINE` environment variable to the location
of the
timeline file to be created. You can then open the timeline file using the chrome://tracing
facility of the Chrome browser.
我是否错过了什么或需要设置一些东西才能让它工作?
谢谢