我正在尝试在 VertexAI 上运行自定义训练作业。目标是训练模型,将模型保存到云存储,然后将其作为 VertexAI 模型对象上传到 VertexAI。当我从本地工作站运行作业时,它会运行,但是当我从 Cloud Scheduler 运行作业时,它会失败。详情如下。
该工作的 Python 代码:
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
import os
import pickle
from google.cloud import storage
from google.cloud import aiplatform
print("FITTING THE MODEL")
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=3)
# define the model
model = RandomForestClassifier(n_estimators=100, n_jobs=-1)
# fit the model
model.fit(X, y)
print("SAVING THE MODEL TO CLOUD STORAGE")
if 'AIP_MODEL_DIR' not in os.environ:
raise KeyError(
'The `AIP_MODEL_DIR` environment variable has not been' +
'set. See https://cloud.google.com/ai-platform-unified/docs/tutorials/image-recognition-custom/training'
)
artifact_filename = 'model' + '.pkl'
# Save model artifact to local filesystem (doesn't persist)
local_path = artifact_filename
with open(local_path, 'wb') as model_file:
pickle.dump(model, model_file)
# Upload model artifact to Cloud Storage
model_directory = os.environ['AIP_MODEL_DIR']
storage_path = os.path.join(model_directory, artifact_filename)
blob = storage.blob.Blob.from_string(storage_path, client=storage.Client())
blob.upload_from_filename(local_path)
print ("UPLOADING MODEL TO VertexAI")
# Upload the model to vertex ai
project="..."
location="..."
display_name="custom_mdoel"
artifact_uri=model_directory
serving_container_image_uri="us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-4:latest"
description="test model"
sync=True
aiplatform.init(project=project, location=location)
model = aiplatform.Model.upload(
display_name=display_name,
artifact_uri=artifact_uri,
serving_container_image_uri=serving_container_image_uri,
description=description,
sync=sync,
)
model.wait()
print("DONE")
从本地工作站运行: 我将 GOOGLE_APPLICATION_CREDENTIALS 环境变量设置为指向我在本地工作站上下载的 Compute Engine 默认服务帐户密钥的位置。我还将 AIP_MODEL_DIR 环境变量设置为指向云存储桶。运行脚本后,我可以看到在云存储桶中创建了 model.pkl 文件,并在 VertexAI 中创建了模型对象。
从 Cloud Scheduler 触发训练作业:
这是我最终想要实现的目标 - 从 Cloud Scheduler 定期运行自定义训练作业。我已将上面的 python 脚本转换为 docker 映像并上传到 google artifact registry。Cloud Scheduler 的作业规范如下,如果需要,我可以提供更多详细信息。服务帐户电子邮件oauth_token
与我用来设置 GOOGLE_APPLICATION_CREDENTIALS 环境变量的键相同。当我运行它时(从本地工作站或直接在 VertexAI 笔记本中),我可以看到创建了 Cloud Scheduler 作业,它不断触发自定义作业。自定义作业能够训练模型并将其保存到云存储中。但是,它无法将其上传到 VertexAI,并且我收到错误消息,status = StatusCode.PERMISSION_DENIED
并且{..."grpc_message":"Request had insufficient authentication scopes.","grpc_status":7
}。无法弄清楚身份验证问题是什么,因为在这两种情况下我都使用相同的服务帐户。
job = {
"name": f'projects/{project_id}/locations/{location}/jobs/test_job',
"description": "Test scheduler job",
"http_target": {
"uri": f'https://{location}-aiplatform.googleapis.com/v1/projects/{project_id}/locations/{location}/customJobs',
"http_method": "POST",
"headers": {
"User-Agent": "Google-Cloud-Scheduler",
"Content-Type": "application/json; charset=utf-8"
},
"body": "..." // the custom training job body,
"oauth_token": {
"service_account_email": "...",
"scope": "https://www.googleapis.com/auth/cloud-platform"
}
},
"schedule": "* * * * *",
"time_zone": "Africa/Abidjan"
}