tensorflow - 如何估计 google cloud ml 中的训练输入？

Question

我正在使用 google cloud ml 来训练模型，以使用 Tensorflow 的对象检测 API 来检测我的自定义对象。通过遵循这个 tensorflow 的指南，并阅读这个 google cloud ml 的文档，我配置了我的训练输入并保存cloud.yml如下：

trainingInput:
  runtimeVersion: "1.2"
  scaleTier: CUSTOM
  masterType: standard_gpu
  workerCount: 5
  workerType: standard_gpu
  parameterServerCount: 3
  parameterServerType: standard

我使用此命令提交作业：

gcloud ml-engine jobs submit training `whoami`_object_detection_`date +%s` \
    --runtime-version 1.2 \
    --job-dir=gs://MY_BUCKET_NAME/train \
    --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
    --module-name object_detection.train \
    --region us-central1 \
    --config object_detection/samples/cloud/cloud.yml \
    -- \
    --train_dir=gs://MY_BUCKET_NAME/train \
    --pipeline_config_path=gs://MY_BUCKET_NAME/data/ssd_mobilenet_v1_pets.config

经过 10 分钟的准备和接下来的 20 分钟的运行，该作业记录了一些奇怪的信息Error reported to Coordinator，然后出现了内存不足的错误The replica master 0 ran out-of-memory and exited with a non-zero status of 247。

我检查了这项工作的资源，发现 master 和所有 5 个工作人员的内存利用率都很高（大约 90 - 95 %）。但是他们的 CPU 利用率是正常的 (60 - 65%)。

我想，我为培训师配置的内存不足。但问题是我如何估计改变训练输入参数？我的意思是我如何知道将更改masterType为更高级别，或更改为更高的workerCount数字，或减少？在这种情况下，图像大小是否相关？batch_sizessd_mobilenet_v1_pets.config

我现在很困惑。谷歌云让我这么快就变穷了，所以启发式似乎买不起。

tensorflow - 如何估计 google cloud ml 中的训练输入？

0 回答 0

Related

Reference