google-cloud-platform - 如何在 AI 平台上同时运行多个 GPU 加速的训练作业

Question

我正在使用该"scaleTier": "BASIC_GPU"设置在 AI Platform 上运行 tensorflow 训练作业。我的理解是，此设置使用单个 Tesla K80 GPU 来完成我的工作。

在另一个作业已经运行时创建新作业似乎会导致新创建的作业被放入队列中，直到正在运行的作业完成。当我检查新作业的日志时，我看到以下消息：

This job is number 1 in the queue and requires 8.000000 CPUs and 1 K80 accelerators. The project is using 8.000000 CPUs out of 450 allowed and 1 K80 accelerators out of 0 TPU_V2_POD, 0 TPU_V3_POD, 1 K80, 1 P100, 1 V100, 4 P4, 4 T4, 8 TPU_V2, 8 TPU_V3 allowed across all regions.The project is using 8.000000 CPUs out of 20 allowed and 1 K80 accelerators out of 0 TPU_V2_POD, 0 TPU_V3_POD, 1 K80, 1 P100, 1 P4, 1 T4, 1 V100, 8 TPU_V2, 8 TPU_V3 allowed in the region us-central1.

这个AI Platform 文档似乎说我的项目应该能够同时使用多达 30 个 K80 GPU。

为什么我什至不能同时使用 2？

我需要做些什么来将我的限制增加到预期的 30 吗？

score 1 · Accepted Answer

看来您的项目管理员已对您可以使用的 GPU 数量设置了配额（请注意，错误消息显示您的配额是 us-central1 中的 20 cpus、1 K80、1 P100），因此该作业正在等待K-80 上市。

两种选择：

(1) 到 console.cloud.google.com/iam-admin/quotas 找到 Compute Engine API 和 K80s 做“Edit Quota”，或者在必要时让你的管理员增加它。确保编辑所有区域配额和 us-central1 配额。否则，如果管理员为每个区域提供了 1 个 GPU，请在 us-west1 等中运行该作业。

(2) 似乎您有 P100 可用，因此使用自定义比例层并指定 P100。

score 1 · Accepted Answer

1

对于新项目，默认配额将非常低。您可以通过此表格请求增加更多配额。

于 2020-07-31T16:47:38.017 回答

google-cloud-platform - 如何在 AI 平台上同时运行多个 GPU 加速的训练作业

2 回答 2

Related

Reference