1

我正在尝试使用 opencv 创建基于 TF 2.4 策划环境的新环境。对 opencv 的支持是唯一的区别。我修改了 dockerfile 以包含 opencv,如下所示:

 FROM mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.0.3-cudnn8-ubuntu18.04:20211005.v1

    ENV AZUREML_CONDA_ENVIRONMENT_PATH /azureml-envs/tensorflow-2.4

    # Create conda environment
    RUN conda create -p $AZUREML_CONDA_ENVIRONMENT_PATH \
        python=3.7 pip=20.2.4

    # Prepend path to AzureML conda environment
    ENV PATH $AZUREML_CONDA_ENVIRONMENT_PATH/bin:$PATH

    # Install pip dependencies
    RUN HOROVOD_WITH_TENSORFLOW=1 \
        pip install 'matplotlib>=3.3,<3.4' \
                    'psutil>=5.8,<5.9' \
                    'tqdm>=4.59,<4.60' \
                    'pandas>=1.1,<1.2' \
                    'scipy>=1.5,<1.6' \
                    'numpy>=1.10,<1.20' \
                    'ipykernel~=6.0' \
                    'azureml-core==1.34.0' \
                    'azureml-defaults==1.34.0' \
                    'azureml-mlflow==1.34.0' \
                    'azureml-telemetry==1.34.0' \
                    'tensorboard==2.4.0' \
                    'tensorflow-gpu==2.4.1' \
                    'tensorflow-datasets==4.3.0' \
                    'onnxruntime-gpu>=1.7,<1.8' \
                    'horovod[tensorflow-gpu]==0.21.3' \
                    'opencv-python'

    # This is needed for mpi to locate libpython
    ENV LD_LIBRARY_PATH $AZUREML_CONDA_ENVIRONMENT_PATH/lib:$LD_LIBRARY_PATH

但是 horovod 无法构建 tensorflow 并显示以下错误消息:

 ERROR: Command errored out with exit status 1:
   command: /azureml-envs/tensorflow-2.4/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-pjyu9d6m/horovod/setup.py'"'"'; __file__='"'"'/tmp/pip-install-pjyu9d6m/horovod/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-0t6zraqk
       cwd: /tmp/pip-install-pjyu9d6m/horovod/
  Complete output (233 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.linux-x86_64-3.7
  creating build/lib.linux-x86_64-3.7/horovod
  copying horovod/__init__.py -> build/lib.linux-x86_64-3.7/horovod
  creating build/lib.linux-x86_64-3.7/horovod/runner
  copying horovod/runner/task_fn.py -> build/lib.linux-x86_64-3.7/horovod/runner
  copying horovod/runner/__init__.py -> build/lib.linux-x86_64-3.7/horovod/runner
  copying horovod/runner/launch.py -> build/lib.linux-x86_64-3.7/horovod/runner
  copying horovod/runner/js_run.py -> build/lib.linux-x86_64-3.7/horovod/runner
  copying horovod/runner/gloo_run.py -> build/lib.linux-x86_64-3.7/horovod/runner
  copying horovod/runner/run_task.py -> build/lib.linux-x86_64-3.7/horovod/runner
  copying horovod/runner/mpi_run.py -> build/lib.linux-x86_64-3.7/horovod/runner
  creating build/lib.linux-x86_64-3.7/horovod/_keras
  copying horovod/_keras/__init__.py -> build/lib.linux-x86_64-3.7/horovod/_keras
  copying horovod/_keras/callbacks.py -> build/lib.linux-x86_64-3.7/horovod/_keras
  copying horovod/_keras/elastic.py -> build/lib.linux-x86_64-3.7/horovod/_keras
  creating build/lib.linux-x86_64-3.7/horovod/torch
  copying horovod/torch/sync_batch_norm.py -> build/lib.linux-x86_64-3.7/horovod/torch
  copying horovod/torch/__init__.py -> build/lib.linux-x86_64-3.7/horovod/torch
  copying horovod/torch/mpi_ops.py -> build/lib.linux-x86_64-3.7/horovod/torch
  copying horovod/torch/optimizer.py -> build/lib.linux-x86_64-3.7/horovod/torch
  copying horovod/torch/functions.py -> build/lib.linux-x86_64-3.7/horovod/torch
  copying horovod/torch/compression.py -> build/lib.linux-x86_64-3.7/horovod/torch
  creating build/lib.linux-x86_64-3.7/horovod/keras
  copying horovod/keras/__init__.py -> build/lib.linux-x86_64-3.7/horovod/keras
  copying horovod/keras/callbacks.py -> build/lib.linux-x86_64-3.7/horovod/keras
  copying horovod/keras/elastic.py -> build/lib.linux-x86_64-3.7/horovod/keras
  creating build/lib.linux-x86_64-3.7/horovod/tensorflow
  copying horovod/tensorflow/sync_batch_norm.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow
  copying horovod/tensorflow/__init__.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow
  copying horovod/tensorflow/elastic.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow
  copying horovod/tensorflow/mpi_ops.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow
  copying horovod/tensorflow/util.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow
  copying horovod/tensorflow/gradient_aggregation_eager.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow
  copying horovod/tensorflow/gradient_aggregation.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow
  copying horovod/tensorflow/functions.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow
  copying horovod/tensorflow/compression.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow
  creating build/lib.linux-x86_64-3.7/horovod/spark
  copying horovod/spark/runner.py -> build/lib.linux-x86_64-3.7/horovod/spark
  copying horovod/spark/__init__.py -> build/lib.linux-x86_64-3.7/horovod/spark
  copying horovod/spark/conf.py -> build/lib.linux-x86_64-3.7/horovod/spark
  copying horovod/spark/gloo_run.py -> build/lib.linux-x86_64-3.7/horovod/spark
  copying horovod/spark/mpi_run.py -> build/lib.linux-x86_64-3.7/horovod/spark
  creating build/lib.linux-x86_64-3.7/horovod/common
  copying horovod/common/__init__.py -> build/lib.linux-x86_64-3.7/horovod/common
  copying horovod/common/exceptions.py -> build/lib.linux-x86_64-3.7/horovod/common
  copying horovod/common/elastic.py -> build/lib.linux-x86_64-3.7/horovod/common
  copying horovod/common/util.py -> build/lib.linux-x86_64-3.7/horovod/common
  copying horovod/common/basics.py -> build/lib.linux-x86_64-3.7/horovod/common
  creating build/lib.linux-x86_64-3.7/horovod/mxnet
  copying horovod/mxnet/__init__.py -> build/lib.linux-x86_64-3.7/horovod/mxnet
  copying horovod/mxnet/mpi_ops.py -> build/lib.linux-x86_64-3.7/horovod/mxnet
  copying horovod/mxnet/functions.py -> build/lib.linux-x86_64-3.7/horovod/mxnet
  creating build/lib.linux-x86_64-3.7/horovod/ray
  copying horovod/ray/runner.py -> build/lib.linux-x86_64-3.7/horovod/ray
  copying horovod/ray/__init__.py -> build/lib.linux-x86_64-3.7/horovod/ray
  copying horovod/ray/ray_logger.py -> build/lib.linux-x86_64-3.7/horovod/ray
  copying horovod/ray/elastic.py -> build/lib.linux-x86_64-3.7/horovod/ray
  copying horovod/ray/utils.py -> build/lib.linux-x86_64-3.7/horovod/ray
  copying horovod/ray/driver_service.py -> build/lib.linux-x86_64-3.7/horovod/ray
  creating build/lib.linux-x86_64-3.7/horovod/runner/util
  copying horovod/runner/util/lsf.py -> build/lib.linux-x86_64-3.7/horovod/runner/util
  copying horovod/runner/util/streams.py -> build/lib.linux-x86_64-3.7/horovod/runner/util
  copying horovod/runner/util/threads.py -> build/lib.linux-x86_64-3.7/horovod/runner/util
  copying horovod/runner/util/__init__.py -> build/lib.linux-x86_64-3.7/horovod/runner/util
  copying horovod/runner/util/remote.py -> build/lib.linux-x86_64-3.7/horovod/runner/util
  copying horovod/runner/util/network.py -> build/lib.linux-x86_64-3.7/horovod/runner/util
  copying horovod/runner/util/cache.py -> build/lib.linux-x86_64-3.7/horovod/runner/util
  creating build/lib.linux-x86_64-3.7/horovod/runner/http
  copying horovod/runner/http/__init__.py -> build/lib.linux-x86_64-3.7/horovod/runner/http
  copying horovod/runner/http/http_client.py -> build/lib.linux-x86_64-3.7/horovod/runner/http
  copying horovod/runner/http/http_server.py -> build/lib.linux-x86_64-3.7/horovod/runner/http
  creating build/lib.linux-x86_64-3.7/horovod/runner/common
  copying horovod/runner/common/__init__.py -> build/lib.linux-x86_64-3.7/horovod/runner/common
  creating build/lib.linux-x86_64-3.7/horovod/runner/task
  copying horovod/runner/task/__init__.py -> build/lib.linux-x86_64-3.7/horovod/runner/task
  copying horovod/runner/task/task_service.py -> build/lib.linux-x86_64-3.7/horovod/runner/task
  creating build/lib.linux-x86_64-3.7/horovod/runner/driver
  copying horovod/runner/driver/__init__.py -> build/lib.linux-x86_64-3.7/horovod/runner/driver
  copying horovod/runner/driver/driver_service.py -> build/lib.linux-x86_64-3.7/horovod/runner/driver
  creating build/lib.linux-x86_64-3.7/horovod/runner/elastic
  copying horovod/runner/elastic/worker.py -> build/lib.linux-x86_64-3.7/horovod/runner/elastic
  copying horovod/runner/elastic/__init__.py -> build/lib.linux-x86_64-3.7/horovod/runner/elastic
  copying horovod/runner/elastic/driver.py -> build/lib.linux-x86_64-3.7/horovod/runner/elastic
  copying horovod/runner/elastic/registration.py -> build/lib.linux-x86_64-3.7/horovod/runner/elastic
  copying horovod/runner/elastic/rendezvous.py -> build/lib.linux-x86_64-3.7/horovod/runner/elastic
  copying horovod/runner/elastic/constants.py -> build/lib.linux-x86_64-3.7/horovod/runner/elastic
  copying horovod/runner/elastic/discovery.py -> build/lib.linux-x86_64-3.7/horovod/runner/elastic
  copying horovod/runner/elastic/settings.py -> build/lib.linux-x86_64-3.7/horovod/runner/elastic
  creating build/lib.linux-x86_64-3.7/horovod/runner/common/util
  copying horovod/runner/common/util/host_hash.py -> build/lib.linux-x86_64-3.7/horovod/runner/common/util
  copying horovod/runner/common/util/__init__.py -> build/lib.linux-x86_64-3.7/horovod/runner/common/util
  copying horovod/runner/common/util/config_parser.py -> build/lib.linux-x86_64-3.7/horovod/runner/common/util
  copying horovod/runner/common/util/timeout.py -> build/lib.linux-x86_64-3.7/horovod/runner/common/util
  copying horovod/runner/common/util/secret.py -> build/lib.linux-x86_64-3.7/horovod/runner/common/util
  copying horovod/runner/common/util/tiny_shell_exec.py -> build/lib.linux-x86_64-3.7/horovod/runner/common/util
  copying horovod/runner/common/util/env.py -> build/lib.linux-x86_64-3.7/horovod/runner/common/util
  copying horovod/runner/common/util/codec.py -> build/lib.linux-x86_64-3.7/horovod/runner/common/util
  copying horovod/runner/common/util/network.py -> build/lib.linux-x86_64-3.7/horovod/runner/common/util
  copying horovod/runner/common/util/settings.py -> build/lib.linux-x86_64-3.7/horovod/runner/common/util
  copying horovod/runner/common/util/safe_shell_exec.py -> build/lib.linux-x86_64-3.7/horovod/runner/common/util
  copying horovod/runner/common/util/hosts.py -> build/lib.linux-x86_64-3.7/horovod/runner/common/util
  creating build/lib.linux-x86_64-3.7/horovod/runner/common/service
  copying horovod/runner/common/service/__init__.py -> build/lib.linux-x86_64-3.7/horovod/runner/common/service
  copying horovod/runner/common/service/driver_service.py -> build/lib.linux-x86_64-3.7/horovod/runner/common/service
  copying horovod/runner/common/service/task_service.py -> build/lib.linux-x86_64-3.7/horovod/runner/common/service
  creating build/lib.linux-x86_64-3.7/horovod/torch/mpi_lib_impl
  copying horovod/torch/mpi_lib_impl/__init__.py -> build/lib.linux-x86_64-3.7/horovod/torch/mpi_lib_impl
  creating build/lib.linux-x86_64-3.7/horovod/torch/mpi_lib
  copying horovod/torch/mpi_lib/__init__.py -> build/lib.linux-x86_64-3.7/horovod/torch/mpi_lib
  creating build/lib.linux-x86_64-3.7/horovod/torch/elastic
  copying horovod/torch/elastic/__init__.py -> build/lib.linux-x86_64-3.7/horovod/torch/elastic
  copying horovod/torch/elastic/state.py -> build/lib.linux-x86_64-3.7/horovod/torch/elastic
  copying horovod/torch/elastic/sampler.py -> build/lib.linux-x86_64-3.7/horovod/torch/elastic
  creating build/lib.linux-x86_64-3.7/horovod/tensorflow/keras
  copying horovod/tensorflow/keras/__init__.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow/keras
  copying horovod/tensorflow/keras/callbacks.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow/keras
  copying horovod/tensorflow/keras/elastic.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow/keras
  creating build/lib.linux-x86_64-3.7/horovod/spark/torch
  copying horovod/spark/torch/__init__.py -> build/lib.linux-x86_64-3.7/horovod/spark/torch
  copying horovod/spark/torch/remote.py -> build/lib.linux-x86_64-3.7/horovod/spark/torch
  copying horovod/spark/torch/estimator.py -> build/lib.linux-x86_64-3.7/horovod/spark/torch
  copying horovod/spark/torch/util.py -> build/lib.linux-x86_64-3.7/horovod/spark/torch
  creating build/lib.linux-x86_64-3.7/horovod/spark/keras
  copying horovod/spark/keras/__init__.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
  copying horovod/spark/keras/remote.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
  copying horovod/spark/keras/optimizer.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
  copying horovod/spark/keras/estimator.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
  copying horovod/spark/keras/tensorflow.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
  copying horovod/spark/keras/util.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
  copying horovod/spark/keras/bare.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
  creating build/lib.linux-x86_64-3.7/horovod/spark/common
  copying horovod/spark/common/store.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
  copying horovod/spark/common/_namedtuple_fix.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
  copying horovod/spark/common/serialization.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
  copying horovod/spark/common/__init__.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
  copying horovod/spark/common/params.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
  copying horovod/spark/common/estimator.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
  copying horovod/spark/common/util.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
  copying horovod/spark/common/backend.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
  copying horovod/spark/common/constants.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
  copying horovod/spark/common/cache.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
  creating build/lib.linux-x86_64-3.7/horovod/spark/task
  copying horovod/spark/task/__init__.py -> build/lib.linux-x86_64-3.7/horovod/spark/task
  copying horovod/spark/task/task_info.py -> build/lib.linux-x86_64-3.7/horovod/spark/task
  copying horovod/spark/task/mpirun_exec_fn.py -> build/lib.linux-x86_64-3.7/horovod/spark/task
  copying horovod/spark/task/gloo_exec_fn.py -> build/lib.linux-x86_64-3.7/horovod/spark/task
  copying horovod/spark/task/task_service.py -> build/lib.linux-x86_64-3.7/horovod/spark/task
  creating build/lib.linux-x86_64-3.7/horovod/spark/driver
  copying horovod/spark/driver/job_id.py -> build/lib.linux-x86_64-3.7/horovod/spark/driver
  copying horovod/spark/driver/__init__.py -> build/lib.linux-x86_64-3.7/horovod/spark/driver
  copying horovod/spark/driver/driver_service.py -> build/lib.linux-x86_64-3.7/horovod/spark/driver
  copying horovod/spark/driver/host_discovery.py -> build/lib.linux-x86_64-3.7/horovod/spark/driver
  copying horovod/spark/driver/rendezvous.py -> build/lib.linux-x86_64-3.7/horovod/spark/driver
  copying horovod/spark/driver/rsh.py -> build/lib.linux-x86_64-3.7/horovod/spark/driver
  copying horovod/spark/driver/mpirun_rsh.py -> build/lib.linux-x86_64-3.7/horovod/spark/driver
  running build_ext
  -- Could not find CCache. Consider installing CCache to speed up compilation.
  -- The CXX compiler identification is GNU 7.5.0
  -- Check for working CXX compiler: /usr/bin/c++
  -- Check for working CXX compiler: /usr/bin/c++ -- works
  -- Detecting CXX compiler ABI info
  -- Detecting CXX compiler ABI info - done
  -- Detecting CXX compile features
  -- Detecting CXX compile features - done
  -- Build architecture flags: -mf16c -mavx -mfma
  -- Using command /azureml-envs/tensorflow-2.4/bin/python
  -- Found MPI_CXX: /usr/local/lib/libmpi.so (found version "3.1")
  -- Found MPI: TRUE (found version "3.1")
  -- Found CUDA: /usr/local/cuda (found version "11.0")
  -- Linking against static NCCL library
  -- Found NCCL: /usr/include
  -- Determining NCCL version from the header file: /usr/include/nccl.h
  -- NCCL_MAJOR_VERSION: 2
  -- Found NCCL (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libnccl_static.a)
  -- The C compiler identification is GNU 7.5.0
  -- Check for working C compiler: /usr/bin/cc
  -- Check for working C compiler: /usr/bin/cc -- works
  -- Detecting C compiler ABI info
  -- Detecting C compiler ABI info - done
  -- Detecting C compile features
  -- Detecting C compile features - done
  -- Found MPI_C: /usr/local/lib/libmpi.so (found version "3.1")
  -- Found MPI: TRUE (found version "3.1")
  -- MPI include path: /usr/local/include
  -- MPI libraries: /usr/local/lib/libmpi.so
  CMake Error at /usr/share/cmake-3.10/Modules/FindPackageHandleStandardArgs.cmake:137 (message):
    Could NOT find Tensorflow (missing: Tensorflow_LIBRARIES) (Required is at
    least version "1.15.0")
  Call Stack (most recent call first):
    /usr/share/cmake-3.10/Modules/FindPackageHandleStandardArgs.cmake:378 (_FPHSA_FAILURE_MESSAGE)
    cmake/Modules/FindTensorflow.cmake:31 (find_package_handle_standard_args)
    horovod/tensorflow/CMakeLists.txt:12 (find_package)
  
  
  -- Configuring incomplete, errors occurred!
  See also "/tmp/pip-install-pjyu9d6m/horovod/build/temp.linux-x86_64-3.7/CMakeFiles/CMakeOutput.log".
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/tmp/pip-install-pjyu9d6m/horovod/setup.py", line 188, in <module>
      'horovodrun = horovod.runner.launch:run_commandline'
    File "/azureml-envs/tensorflow-2.4/lib/python3.7/site-packages/setuptools/__init__.py", line 153, in setup
      return distutils.core.setup(**attrs)
    File "/azureml-envs/tensorflow-2.4/lib/python3.7/distutils/core.py", line 148, in setup
      dist.run_commands()
    File "/azureml-envs/tensorflow-2.4/lib/python3.7/distutils/dist.py", line 966, in run_commands
      self.run_command(cmd)
    File "/azureml-envs/tensorflow-2.4/lib/python3.7/distutils/dist.py", line 985, in run_command
      cmd_obj.run()
    File "/azureml-envs/tensorflow-2.4/lib/python3.7/site-packages/wheel/bdist_wheel.py", line 299, in run
      self.run_command('build')
    File "/azureml-envs/tensorflow-2.4/lib/python3.7/distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/azureml-envs/tensorflow-2.4/lib/python3.7/distutils/dist.py", line 985, in run_command
      cmd_obj.run()
    File "/azureml-envs/tensorflow-2.4/lib/python3.7/distutils/command/build.py", line 135, in run
      self.run_command(cmd_name)
    File "/azureml-envs/tensorflow-2.4/lib/python3.7/distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/azureml-envs/tensorflow-2.4/lib/python3.7/distutils/dist.py", line 985, in run_command
      cmd_obj.run()
    File "/azureml-envs/tensorflow-2.4/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 79, in run
      _build_ext.run(self)
    File "/azureml-envs/tensorflow-2.4/lib/python3.7/distutils/command/build_ext.py", line 340, in run
      self.build_extensions()
    File "/tmp/pip-install-pjyu9d6m/horovod/setup.py", line 89, in build_extensions
      cwd=self.build_temp)
    File "/azureml-envs/tensorflow-2.4/lib/python3.7/subprocess.py", line 363, in check_call
      raise CalledProcessError(retcode, cmd)
  subprocess.CalledProcessError: Command '['cmake', '/tmp/pip-install-pjyu9d6m/horovod', '-DCMAKE_BUILD_TYPE=RelWithDebInfo', '-DCMAKE_LIBRARY_OUTPUT_DIRECTORY_RELWITHDEBINFO=/tmp/pip-install-pjyu9d6m/horovod/build/lib.linux-x86_64-3.7', '-DPYTHON_EXECUTABLE:FILEPATH=/azureml-envs/tensorflow-2.4/bin/python']' returned non-zero exit status 1.
  ----------------------------------------
  ERROR: Failed building wheel for horovod

我是 Azure-ml 的新手,我发现文档有点不清楚。我还尝试通过执行 conda_dep.add_pip_package("opencv-python") 将 opencv-python 添加到现有的策划环境中。结果是一样的。

4

1 回答 1

0

为计算集群提供的一些精选图像。可以为您的个人工作流程自定义以下 Dockerfile。 https://docs.microsoft.com/en-us/azure/machine-learning/resource-curated-environments#tensorflow

这是分布式 GPU 训练指南的链接。

于 2021-10-25T04:25:50.017 回答