我有一个加载 Pytorch 模型的容器。每次我尝试启动它时,我都会收到此错误:
Traceback (most recent call last):
File "server/start.py", line 166, in <module>
start()
File "server/start.py", line 94, in start
app.register_blueprint(create_api(), url_prefix="/api/1")
File "/usr/local/src/skiff/app/server/server/api.py", line 30, in create_api
atomic_demo_model = DemoModel(model_filepath, comet_dir)
File "/usr/local/src/comet/comet/comet/interactive/atomic_demo.py", line 69, in __init__
model = interactive.make_model(opt, n_vocab, n_ctx, state_dict)
File "/usr/local/src/comet/comet/comet/interactive/functions.py", line 98, in make_model
model.to(cfg.device)
File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 381, in to
return self._apply(convert)
File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
module._apply(fn)
File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
module._apply(fn)
File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 193, in _apply
param.data = fn(param.data)
File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 379, in convert
return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
File "/usr/local/lib/python3.7/site-packages/torch/cuda/__init__.py", line 161, in _lazy_init
_check_driver()
File "/usr/local/lib/python3.7/site-packages/torch/cuda/__init__.py", line 82, in _check_driver
http://www.nvidia.com/Download/index.aspx""")
AssertionError:
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx
我知道那nvidia-docker2
行得通。
$ docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
Tue Jul 16 22:09:40 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39 Driver Version: 418.39 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:1A:00.0 Off | N/A |
| 0% 44C P0 72W / 260W | 0MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:1B:00.0 Off | N/A |
| 0% 44C P0 66W / 260W | 0MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce RTX 208... Off | 00000000:1E:00.0 Off | N/A |
| 0% 44C P0 48W / 260W | 0MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce RTX 208... Off | 00000000:3E:00.0 Off | N/A |
| 0% 41C P0 54W / 260W | 0MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 GeForce RTX 208... Off | 00000000:3F:00.0 Off | N/A |
| 0% 42C P0 48W / 260W | 0MiB / 10989MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
| 5 GeForce RTX 208... Off | 00000000:41:00.0 Off | N/A |
| 0% 42C P0 1W / 260W | 0MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
但是,我不断收到上述错误。
我尝试了以下方法:
设置
"default-runtime": nvidia
_/etc/docker/daemon.json
使用
docker run --runtime=nvidia <IMAGE_ID>
将以下变量添加到我的 Dockerfile 中:
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
LABEL com.nvidia.volumes.needed="nvidia_driver"
我希望这个容器能够运行——我们有一个没有这些问题的生产版本。而且我知道 Docker 可以找到驱动程序,如上面的输出所示。有任何想法吗?