0

我们正在尝试在 AI Platform 上运行我们的训练任务时捕获 TPU 分析数据。按照本教程。从我们的模型输出中获取所有需要的信息,例如 TPU 名称。

配置.yaml:

trainingInput:
  scaleTier: BASIC_TPU
  runtimeVersion: '1.15' # also tried '2.1'

任务提交命令:

export DATE=$(date '+%Y%m%d_%H%M%S') && \
gcloud ai-platform jobs submit training "imaterialist_image_classification_model_${DATE}" \
--region=us-central1 \
--staging-bucket='gs://${BUCKET}' \
--module-name='efficientnet.main' \
--config=config.yaml \
--package-path="${PWD}/efficientnet" \
-- \
--data_dir='gs://${BUCKET}/tfrecords/' \
--train_batch_size=8 \
--train_steps=5 \
--model_dir="gs://${BUCKET}/algorithms_training/imaterialist_image_classification_model/${DATE}" \
--model_name='efficientnet-b4' \
--skip_host_call=true \
--gcp_project=${GCP_PROJECT_ID} \
--mode=train

当我们尝试capture_tpu_profile使用模型从 master 获得的名称运行时:

capture_tpu_profile --gcp_project="${GCP_PROJECT_ID}" --logdir='gs://${BUCKET}/algorithms_training/imaterialist_image_classification_model/20200318_005446' --tpu_zone='us-central1-b' --tpu='<tpu_IP_address>'

我们得到了这个错误:

  File "/home/kovtuh/.local/lib/python3.7/site-packages/tensorflow_core/python/distribute/cluster_resolver/tpu_cluster_resolver.py", line 480, in _fetch_cloud_tpu_metadata
    "constructor. Exception: %s" % (self._tpu, e))
ValueError: Could not lookup TPU metadata from name 'b'<tpu_IP_address>''. Please doublecheck the tpu argument in the TPUClusterResolver constructor. Exception: <HttpError 404 when requesting https://tpu.googleapis.com/v1/projects/<GCP_PROJECT_ID>/locations/us-central1-b/nodes/<tpu_IP_address>?alt=json returned "Resource 'projects/<GCP_PROJECT_ID>/locations/us-central1-b/nodes/<tpu_IP_address>' was not found". Details: "[{'@type': 'type.googleapis.com/google.rpc.ResourceInfo', 'resourceName': 'projects/<GCP_PROJECT_ID>/locations/us-central1-b/nodes/<tpu_IP_address>'}]">

在 AI Platform 中提供时,似乎 TPU 设备未连接到我们的项目,但是连接到什么项目,我们可以访问此类 TPU 以捕获其配置文件吗?

4

0 回答 0