google-cloud-platform - 无法使用 Unified Cloud AI Platform 自定义容器创建端点

Question

由于某些 VPC 限制，我被迫使用自定义容器来预测在 Tensorflow 上训练的模型。根据文档要求，我使用 Tensorflow Serving 创建了一个 HTTP 服务器。用于build镜像的Dockerfile如下：

FROM tensorflow/serving:2.4.1-gpu

# copy the model file
ENV MODEL_NAME=my_model
COPY my_model /models/my_model

其中my_model包含saved_model一个名为1/.

然后，我将此图像推送到 Google Container Repository，然后Model通过使用Import an existing custom container并将其更改Port为 8501 创建了一个。但是，当尝试使用 n1-standard-16 类型的单个计算节点和 1 个 P100 GPU 将模型部署到端点时部署遇到以下错误：

Failed to create session: Internal: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

我无法弄清楚这是怎么发生的。我能够在本地机器上运行相同的 docker 映像，并且能够通过点击创建的端点成功地获得预测：http://localhost:8501/v1/models/my_model:predict

任何帮助是这方面的将不胜感激。

score 0 · Accepted Answer

该问题已通过将Tensorflow 服务图像降级到2.3.0-gpu版本来解决。根据错误上下文，自定义模型映像中的 CUDA 驱动程序与 GCP AI Platform 训练集群中的相应驱动程序版本不匹配。

google-cloud-platform - 无法使用 Unified Cloud AI Platform 自定义容器创建端点

1 回答 1

Related

Reference