0

我正在 Google Cloud 的 AI 平台上使用 TensorFlow 训练模型,虽然训练本身进行得很好,但我无法将已完成的模型以 SavedModel 格式保存到我的云存储桶中。我知道存储桶设置正确,因为在训练开始时我从同一个存储桶下载我的训练数据。这是我用来保存模型的代码:

SAVE_PATH = os.path.join("gs://", 'machine-learning-ebay', 'job-dir')
linear_model.save(SAVE_PATH)

其中“machine-learning-ebay”是存储桶,“job-dir”是该存储桶中的一个文件夹。

我在谷歌云的职位描述页面上收到以下错误:

Traceback (most recent call last):
  [...]
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/tracking/util.py", line 1219, in save
    file_prefix_tensor, object_graph_tensor, options)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/tracking/util.py", line 1164, in _save_cached_when_graph_building
    save_op = saver.save(file_prefix, options=options)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/saving/functional_saver.py", line 300, in save
    return save_fn()
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/saving/functional_saver.py", line 287, in save_fn
    sharded_prefixes, file_prefix, delete_old_dirs=True)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 504, in merge_v2_checkpoints
    delete_old_dirs=delete_old_dirs, name=name, ctx=_ctx)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 528, in merge_v2_checkpoints_eager_fallback
    attrs=_attrs, ctx=ctx, name=name)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.NotFoundError: Error executing an HTTP request: HTTP response code 404 with body '{
  "error": {
    "code": 404,
    "message": "No such object: machine-learning-ebay/job-dir/variables/variables_temp/part-00000-of-00001.data-00000-of-00001",
    "errors": [
      {
        "message": "No such object: machine-learning-ebay/job-dir/variables/variables_temp/part-00000-of-00001.data-00000-of-00001",
        "domain": "global",
        "reason": "notFound"
      }
    ]
  }
}

任何帮助是极大的赞赏; 这个项目的截止日期是今天。

4

1 回答 1

1

遵循 Google 的培训示例(https://github.com/GoogleCloudPlatform/cloudml-samples/blob/main/census/tf-keras/trainer/task.py)中的代码和一个 GitHub 问题,它说时间戳输出文件夹解决了覆盖问题(https://github.com/kubeflow/pipelines/issues/2171),我将导出代码更改为以下内容:

current_time = now.strftime("%H.%M.%S")
tf.compat.v1.keras.experimental.export_saved_model(linear_model,'gs://machine-learning-ebay/job-dir/keras-export'+current_time)  

这解决了我面临的训练错误,成功导出了模型。

于 2022-01-07T17:47:53.667 回答