1

我正在尝试复制工作/实验,这些工作/实验需要我遵循这个关于设置 Jupyter + Tensorflow + Nvidia GPU + Docker + Google Compute Engine 的特定教程。'

我能够成功安装nvidia-docker. 但是,在教程中的部分下Verify the GPU is Visible from a Docker Container,当我尝试运行时

sudo nvidia-docker-plugin

我收到以下错误(见最后一行):

nvidia-docker-plugin | 2019/04/23 15:17:47 Loading NVIDIA unified memory
nvidia-docker-plugin | 2019/04/23 15:17:47 Loading NVIDIA management library
nvidia-docker-plugin | 2019/04/23 15:17:47 Discovering GPU devices
nvidia-docker-plugin | 2019/04/23 15:17:47 Provisioning volumes at /var/lib/nvidia-docker/volumes
nvidia-docker-plugin | 2019/04/23 15:17:47 Serving plugin API at /run/docker/plugins
nvidia-docker-plugin | 2019/04/23 15:17:47 Serving remote API at localhost:3476
nvidia-docker-plugin | 2019/04/23 15:17:47 Error: listen tcp 127.0.0.1:3476: bind: address already in use

当我跑步时

sudo nvidia-docker run --rm nvidia/cuda nvidia-smi

我碰巧收到以下executable file not found in $PATH": unknown错误:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH": unknown.
ERRO[0000] error waiting for container: context canceled 

我对 docker 很陌生;因此,如果有人可以帮助我完成解决方案,那就太好了。我试过寻找答案,但解决问题的实际过程却让我回避了。任何帮助将不胜感激。

编辑:我按照教程中的说明设置了 GCE 实例(即 Ubuntu 16.04 LTS,50GB 引导磁盘,1 个 GPU,带有 jupyter 和 tensorboard)

4

1 回答 1

3

要解决第一个问题,看起来 nvidia-docker-plugin 已经在运行。要查找此服务,请使用:

sudo netstat -tlpn | grep 3476

并杀死它:

sudo pkill nvidia-docker

第二,安装 nvidia-docker2 并重新加载 Docker 守护进程配置:

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
  sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update

# Install nvidia-docker2 and reload the Docker daemon configuration
sudo apt-get install -y nvidia-docker2
sudo pkill -SIGHUP dockerd

更多详情的链接:

于 2019-05-01T18:11:58.437 回答