我在 COS 上使用带有 T4 GPU 的 Compute Engine VM 已经有一段时间了,它一直运行良好,直到最近它cos-extensions install gpu
不像以前那样运行。
I0830 07:32:58.419130 987 main.go:21] Checking if this is the only cos_gpu_installer that is running.
I0830 07:32:58.427417 987 install.go:74] Running on COS build id 16108.470.16
I0830 07:32:58.427566 987 installer.go:187] Getting the default GPU driver version
I0830 07:32:58.427911 987 utils.go:72] Downloading gpu_default_version from https://storage.googleapis.com/cos-tools/16108.470.16/gpu_default_version
I0830 07:32:58.548403 987 utils.go:120] Successfully downloaded gpu_default_version from https://storage.googleapis.com/cos-tools/16108.470.16/gpu_default_version
I0830 07:32:58.548594 987 install.go:85] Installing GPU driver version 450.119.04
I0830 07:32:58.549646 987 cache.go:72] map[BUILD_ID:16108.470.11 DRIVER_VERSION:450.119.04]
I0830 07:32:58.549674 987 install.go:120] Did not find cached version, installing the drivers...
I0830 07:32:58.549681 987 installer.go:82] Configuring driver installation directories
I0830 07:32:58.563327 987 installer.go:196] Updating container's ld cache
I0830 07:32:58.793692 987 signature.go:30] Downloading driver signature for version 450.119.04
I0830 07:32:58.793721 987 utils.go:72] Downloading 450.119.04.signature.tar.gz from https://storage.googleapis.com/cos-tools/16108.470.16/extensions/gpu/450.119.04.signature.tar.gz
E0830 07:32:58.828902 987 artifacts.go:106] Failed to download extensions/gpu/450.119.04.signature.tar.gz from public GCS: failed to download 450.119.04.signature.tar.gz, status: 404 Not Found
E0830 07:32:58.829401 987 install.go:175] failed to download driver signature: failed to download driver signature for version 450.119.04: failed to download extensions/gpu/450.119.04.signature.tar.gz
安装程序似乎找不到驱动程序签名。我已经对此进行了调查并遵循了解决方法
/usr/bin/docker run --rm \
--privileged \
--net=host \
--pid=host \
--volume /dev:/dev \
--volume /:/root \
--volume /var/lib/toolbox/nvidia:/usr/local/nvidia \
--env NVIDIA_DRIVER_VERSION=450.119.04 \
gcr.io/cos-cloud/cos-gpu-installer:latest
但得到了这个
+ COS_KERNEL_INFO_FILENAME=kernel_info
+ COS_KERNEL_SRC_HEADER=kernel-headers.tgz
+ TOOLCHAIN_URL_FILENAME=toolchain_url
+ TOOLCHAIN_ENV_FILENAME=toolchain_env
+ TOOLCHAIN_PKG_DIR=/build/cos-tools
+ CHROMIUMOS_SDK_GCS=https://storage.googleapis.com/chromiumos-sdk
+ ROOT_OS_RELEASE=/root/etc/os-release
+ KERNEL_SRC_HEADER=/build/usr/src/linux
+ NVIDIA_DRIVER_VERSION=450.119.04
+ NVIDIA_DRIVER_MD5SUM=
+ NVIDIA_INSTALL_DIR_HOST=/var/lib/nvidia
+ NVIDIA_INSTALL_DIR_CONTAINER=/usr/local/nvidia
+ ROOT_MOUNT_DIR=/root
+ CACHE_FILE=/usr/local/nvidia/.cache
+ LOCK_FILE=/root/tmp/cos_gpu_installer_lock
+ LOCK_FILE_FD=20
+ set +x
[INFO 2021-08-30 07:36:38 UTC] PRELOAD: false
[INFO 2021-08-30 07:36:38 UTC] Running on COS build id 16108.470.16
[INFO 2021-08-30 07:36:38 UTC] Data dependencies (e.g. kernel source) will be fetched from https://storage.googleapis.com/cos-tools/16108.470.16
[INFO 2021-08-30 07:36:38 UTC] Checking if this is the only cos-gpu-installer that is running.
[INFO 2021-08-30 07:36:38 UTC] Checking if third party kernel modules can be installed
/tmp/esp /
/
[INFO 2021-08-30 07:36:38 UTC] Checking cached version
/entrypoint.sh: line 172: CACHE_BUILD_ID: unbound variable
似乎 COS 和 COS GPU 驱动程序(也许?)发生了一些变化,但只是想知道除了等待 GCP 解决问题之外是否有解决此问题的方法。