我正在尝试使用代表https://github.com/microsoft/MLAKSDeployAML/使用 AKS 部署 AML 服务。
在一台 NC6_v2 DSVM 机器上创建了这个,在努力让 conda 工作之后,我终于得到了我的环境设置并开始运行笔记本。
我提交了实验,然后等待 run.wait_for_completion(show_output=True) 并出现 HTTP 错误。完整的控制日志附在下面。
这可能与作为 GPU 机器有关,还是该服务还有其他问题?
Streaming log file azureml-logs/60_control_log.txt
Starting the daemon thread to refresh tokens in background for process with pid = 13317
nvidia-docker is installed on the target. Using nvidia-docker for docker operations.
Running: ['/bin/bash', '/tmp/azureml_runs/mlaks-train-on-local_1569245453_408a217b/azureml-environment-setup/docker_env_checker.sh']
Materialized image not found on target: azureml/azureml_473a6fe028e178fff5c9a8d49bc938f3
Logging experiment preparation status in history service.
Running: ['/bin/bash', '/tmp/azureml_runs/mlaks-train-on-local_1569245453_408a217b/azureml-environment-setup/docker_env_builder.sh']
Running: ['nvidia-docker', 'build', '-f', 'azureml-environment-setup/Dockerfile', '-t', 'azureml/azureml_473a6fe028e178fff5c9a8d49bc938f3', '.']
Sending build context to Docker daemon 410.1kB
Step 1/15 : FROM continuumio/miniconda3@sha256:54eb3dd4003f11f6a651b55fc2074a0ed6d9eeaa642f1c4c9a7cf8b148a30ceb
---> 4a51de2367be
Step 2/15 : USER root
---> Using cache
---> 42491a367cef
Step 3/15 : RUN mkdir -p $HOME/.cache
---> Using cache
---> 0771da9ffb76
Step 4/15 : WORKDIR /
---> Using cache
---> a8db57273ffb
Step 5/15 : COPY azureml-environment-setup/99brokenproxy /etc/apt/apt.conf.d/
---> Using cache
---> b2a669b740ca
Step 6/15 : RUN if dpkg --compare-versions `conda --version | grep -oE '[^ ]+$'` lt 4.4.11; then conda install conda==4.4.11; fi
---> Using cache
---> 1e430aeb68b0
Step 7/15 : COPY azureml-environment-setup/mutated_conda_dependencies.yml azureml-environment-setup/mutated_conda_dependencies.yml
---> Using cache
---> 0c6a9fafa84b
Step 8/15 : RUN ldconfig /usr/local/cuda/lib64/stubs && conda env create -p /azureml-envs/azureml_6303d702d8163bbfc0017533e979d4a3 -f azureml-environment-setup/mutated_conda_dependencies.yml && rm -rf "$HOME/.cache/pip" && conda clean -aqy && CONDA_ROOT_DIR=$(conda info --root) && rm -rf "$CONDA_ROOT_DIR/pkgs" && find "$CONDA_ROOT_DIR" -type d -name __pycache__ -exec rm -rf {} + && ldconfig
---> Running in a579672607b3
Warning: you have pip-installed dependencies in your environment file, but you do not list pip itself as one of your conda dependencies. Conda may not use the correct pip to install your packages, and they may end up in the wrong place. Please add an explicit pip dependency. I'm adding one for you, but still nagging you.
Collecting package metadata (repodata.json): ...working... failed
CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/conda-forge/linux-64/repodata.json>
Elapsed: -
An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.
ConnectionError(MaxRetryError("HTTPSConnectionPool(host='conda.anaconda.org', port=443): Max retries exceeded with url: /conda-forge/linux-64/repodata.json (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fbb8c38cda0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))"))
The command '/bin/sh -c ldconfig /usr/local/cuda/lib64/stubs && conda env create -p /azureml-envs/azureml_6303d702d8163bbfc0017533e979d4a3 -f azureml-environment-setup/mutated_conda_dependencies.yml && rm -rf "$HOME/.cache/pip" && conda clean -aqy && CONDA_ROOT_DIR=$(conda info --root) && rm -rf "$CONDA_ROOT_DIR/pkgs" && find "$CONDA_ROOT_DIR" -type d -name __pycache__ -exec rm -rf {} + && ldconfig' returned a non-zero code: 1
CalledProcessError(1, ['nvidia-docker', 'build', '-f', 'azureml-environment-setup/Dockerfile', '-t', 'azureml/azureml_473a6fe028e178fff5c9a8d49bc938f3', '.'])
Building docker image failed with exit code: 1
Logging error in history service: Failed to run ['/bin/bash', '/tmp/azureml_runs/mlaks-train-on-local_1569245453_408a217b/azureml-environment-setup/docker_env_builder.sh']
Exit code 1
Details can be found in azureml-logs/60_control_log.txt log file.
Uploading control log...
Sending final run history status...
Logging experiment failed status in history service.
Control script execution completed