0

我正在尝试使用自定义容器在 Google AI Platform 上启动训练作业。因为我想使用 GPU 进行训练,所以我用于容器的基本图像是:

FROM nvidia/cuda:11.1.1-cudnn8-runtime-ubuntu18.04

有了这张图片(并在上面安装了 tensorflow 2.4.1),我以为我可以在 AI Platform 上使用 GPU,但似乎并非如此。训练开始时,日志显示如下:

W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (gke-cml-0309-144111--n1-highmem-8-43e-0b9fbbdc-gnq6): /proc/driver/nvidia/version does not exist
I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
WARNING:tensorflow:There are non-GPU devices in `tf.distribute.Strategy`, not using nccl allreduce.

这是构建图像以在 Google AI Platform 上使用 GPU 的好方法吗?或者我应该尝试依赖张量流图像并手动安装所有需要的驱动程序来利用 GPU?

编辑:我在这里阅读(https://cloud.google.com/ai-platform/training/docs/containers-overview)以下内容:

For training with GPUs, your custom container needs to meet a few
special requirements. You must build a different Docker image than     
what you'd use for training with CPUs.

Pre-install the CUDA toolkit and cuDNN in your Docker image. Using the 
nvidia/cuda image as your base image is the recommended way to handle 
this. It has the matching versions of CUDA toolkit and cuDNN pre-
installed, and it helps you set up the related environment variables 
correctly.

Install your training application, along with your required ML     
framework and other dependencies in your Docker image.

他们还在此处提供了一个 Dockerfile 示例,用于使用 GPU 进行训练。所以我所做的似乎还可以。不幸的是,我仍然有上面提到的这些错误,这些错误可以解释(或不解释)为什么我不能在 Google AI 平台上使用 GPU。

EDIT2:正如这里所读(https://www.tensorflow.org/install/gpu),我的 Dockerfile 现在是:

FROM tensorflow/tensorflow:2.4.1-gpu
RUN apt-get update && apt-get install -y \
 lsb-release \
 vim \
 curl \
 git \
 libgl1-mesa-dev \
 software-properties-common \
 wget && \
 rm -rf /var/lib/apt/lists/*

# Add NVIDIA package repositories
RUN wget -nv https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
RUN mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
RUN add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"
RUN apt-get update

RUN wget -nv http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb

RUN apt install ./nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
RUN apt-get update

# Install NVIDIA driver
RUN apt-get install -y --no-install-recommends nvidia-driver-450
# Reboot. Check that GPUs are visible using the command: nvidia-smi

RUN wget -nv https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/libnvinfer7_7.1.3-1+cuda11.0_amd64.deb
RUN apt install ./libnvinfer7_7.1.3-1+cuda11.0_amd64.deb
RUN apt-get update

# Install development and runtime libraries (~4GB)
RUN apt-get install --no-install-recommends \
    cuda-11-0 \
    libcudnn8=8.0.4.30-1+cuda11.0  \
    libcudnn8-dev=8.0.4.30-1+cuda11.0


# other stuff

问题是构建在似乎是键盘配置的阶段冻结。系统要求选择一个国家,当我输入数字时,没有任何反应

在此处输入图像描述

4

1 回答 1

2

构建最可靠容器的建议方法是使用官方维护的“深度学习容器”。我建议拉“gcr.io/deeplearning-platform-release/tf2-gpu.2-4”。这应该已经安装和测试了 CUDA、CUDNN、GPU 驱动程序和 TF 2.4。您只需将代码添加到其中。

于 2021-03-11T01:05:47.320 回答