2

我编写了一个训练神经网络的 python 包。然后我使用以下命令将其打包。

python3 setup.py sdist --formats=gztar

当我通过 GCP 控制台运行此作业并手动单击所有选项时,我会按预期从程序中获取日志(请参见下面的示例)

成功日志示例: 在此处输入图像描述

但是,当我以编程方式运行完全相同的作业时,不会出现任何日志。只有最后一个错误(如果发生):

缺少示例日志: 在此处输入图像描述

在这两种情况下,程序都在运行——我只是看不到任何输出。这可能是什么原因?作为参考,我还包含了我用来以编程方式启动训练过程的代码:

ENTRY_POINT = "projects.yaw_correction.yaw_correction"
TIMESTAMP = datetime.datetime.strftime(datetime.datetime.now(),"%y%m%d_%H%M%S")
PROJECT = "yaw_correction"
GCP_PROJECT = "our_gcp_project_name"
LOCATION = "europe-west1"
BUCKET_NAME = "our_bucket_name"
DISPLAY_NAME = "Training_Job_" + TIMESTAMP
CONTAINER_URI = "europe-docker.pkg.dev/vertex-ai/training/pytorch-xla.1-9:latest"
MODEL_NAME = "Model_" + TIMESTAMP
ARGS = [f"/gcs/fotokite-training-data/yaw_correction/", "--cloud", "--gpu"]
TENSORBOARD = "projects/"our_gcp_project_name"/locations/europe-west4/tensorboards/yaw_correction"

MACHINE_TYPE = "n1-standard-4"
REPLICA_COUNT = 1
ACCELERATOR_TYPE = "ACCELERATOR_TYPE_UNSPECIFIED"
ACCELERATOR_COUNT = 0
SYNC = False

#Delete existing source distributions
def deleteDist():
    dirpath = Path('dist')
    if dirpath.exists() and dirpath.is_dir():
        shutil.rmtree(dirpath)

# Copy distribution to the cloud bucket storage
deleteDist()
subprocess.run("python3 setup.py sdist --formats=gztar", shell=True)
filename = [x for x in Path('dist').glob('*')]
if len(filename) != 1:
    raise Exception("More than one distribution was found")
print(str(filename[0]))
PACKAGE_URI = f"gs://{BUCKET_NAME}/distributions/"
subprocess.run(f"gsutil cp {str(filename[0])} {PACKAGE_URI}", shell=True)
PACKAGE_URI += str(filename[0].name)
deleteDist()

# Initialise the compute instance
aiplatform.init(project=GCP_PROJECT, location=LOCATION, staging_bucket=BUCKET_NAME)

# Schedule the job
job = aiplatform.CustomPythonPackageTrainingJob(
    display_name=DISPLAY_NAME,
    #script_path="trainer/test.py",
    python_package_gcs_uri=PACKAGE_URI,
    python_module_name=ENTRY_POINT,
    #requirements=['tensorflow_datasets~=4.2.0', 'SQLAlchemy~=1.4.26', 'google-cloud-secret-manager~=2.7.2', 'cloud-sql-python-connector==0.4.2', 'PyMySQL==1.0.2'],
    container_uri=CONTAINER_URI,
)

model = job.run(
    dataset=None,
    #base_output_dir=f"gs://{BUCKET_NAME}/{PROJECT}/Train_{TIMESTAMP}",
    base_output_dir=f"gs://{BUCKET_NAME}/{PROJECT}/",
    service_account="vertex-ai-fotokite-service-acc@fotokite-cv-gcp-exploration.iam.gserviceaccount.com",
    environment_variables=None,
    args=ARGS,
    replica_count=REPLICA_COUNT,
    machine_type=MACHINE_TYPE,
    accelerator_type=ACCELERATOR_TYPE,
    accelerator_count=ACCELERATOR_TYPE,
    #tensorboard=TENSORBOARD,
    sync=SYNC
)
print(model)
print("JOB SUBMITTED")
4

1 回答 1

0

通常这种错误“副本workerpool0-0以非零状态1退出”是因为在打包python文件的过程中或代码中出现了问题。

您可以看到这些选项。

  • 您可以检查所有文件是否都在包中(培训文件和依赖项),如下例所示:
setup.py

demo/PKG

demo/SOURCES.txt

demo/dependency_links.txt

demo/requires.txt

demo/level.txt

trainer/__init__.py

trainer/metadata.py

trainer/model.py

trainer/task.py

trainer/utils.py
于 2021-11-12T21:07:34.540 回答