0

我正在尝试在 VertexAI 上运行自定义训练作业。目标是训练模型,将模型保存到云存储,然后将其作为 VertexAI 模型对象上传到 VertexAI。当我从本地工作站运行作业时,它会运行,但是当我从 Cloud Scheduler 运行作业时,它会失败。详情如下。

该工作的 Python 代码:

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
import os
import pickle
from google.cloud import storage
from google.cloud import aiplatform

print("FITTING THE MODEL")

# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=3)
# define the model
model = RandomForestClassifier(n_estimators=100, n_jobs=-1)
# fit the model
model.fit(X, y)


print("SAVING THE MODEL TO CLOUD STORAGE")
if 'AIP_MODEL_DIR' not in os.environ:
    raise KeyError(
        'The `AIP_MODEL_DIR` environment variable has not been' +
        'set. See https://cloud.google.com/ai-platform-unified/docs/tutorials/image-recognition-custom/training'
    )

artifact_filename = 'model' + '.pkl'
# Save model artifact to local filesystem (doesn't persist)
local_path = artifact_filename
with open(local_path, 'wb') as model_file:
    pickle.dump(model, model_file)

# Upload model artifact to Cloud Storage
model_directory = os.environ['AIP_MODEL_DIR']
storage_path = os.path.join(model_directory, artifact_filename)
blob = storage.blob.Blob.from_string(storage_path, client=storage.Client())
blob.upload_from_filename(local_path)


print ("UPLOADING MODEL TO VertexAI")

# Upload the model to vertex ai
project="..."
location="..."
display_name="custom_mdoel"
artifact_uri=model_directory
serving_container_image_uri="us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-4:latest"
description="test model"
sync=True

aiplatform.init(project=project, location=location)
model = aiplatform.Model.upload(
    display_name=display_name,
    artifact_uri=artifact_uri,
    serving_container_image_uri=serving_container_image_uri,
    description=description,
    sync=sync,
)
model.wait()

print("DONE")

从本地工作站运行: 我将 GOOGLE_APPLICATION_CREDENTIALS 环境变量设置为指向我在本地工作站上下载的 Compute Engine 默认服务帐户密钥的位置。我还将 AIP_MODEL_DIR 环境变量设置为指向云存储桶。运行脚本后,我可以看到在云存储桶中创建了 model.pkl 文件,并在 VertexAI 中创建了模型对象。

从 Cloud Scheduler 触发训练作业: 这是我最终想要实现的目标 - 从 Cloud Scheduler 定期运行自定义训练作业。我已将上面的 python 脚本转换为 docker 映像并上传到 google artifact registry。Cloud Scheduler 的作业规范如下,如果需要,我可以提供更多详细信息。服务帐户电子邮件oauth_token与我用来设置 GOOGLE_APPLICATION_CREDENTIALS 环境变量的键相同。当我运行它时(从本地工作站或直接在 VertexAI 笔记本中),我可以看到创建了 Cloud Scheduler 作业,它不断触发自定义作业。自定义作业能够训练模型并将其保存到云存储中。但是,它无法将其上传到 VertexAI,并且我收到错误消息,status = StatusCode.PERMISSION_DENIED并且{..."grpc_message":"Request had insufficient authentication scopes.","grpc_status":7}。无法弄清楚身份验证问题是什么,因为在这两种情况下我都使用相同的服务帐户。

job = {
  "name": f'projects/{project_id}/locations/{location}/jobs/test_job',
  "description": "Test scheduler job",
  "http_target": {
    "uri": f'https://{location}-aiplatform.googleapis.com/v1/projects/{project_id}/locations/{location}/customJobs',
    "http_method": "POST",
    "headers": {
      "User-Agent": "Google-Cloud-Scheduler",
      "Content-Type": "application/json; charset=utf-8"
    },
    "body": "..." // the custom training job body,
    "oauth_token": {
      "service_account_email": "...",
      "scope": "https://www.googleapis.com/auth/cloud-platform"
    }
  },
  "schedule": "* * * * *",
  "time_zone": "Africa/Abidjan"
}
4

0 回答 0