0

我正在关注 PAI 工作教程

这是我的工作配置:

{
  "jobName": "yuan_tensorflow-distributed-jobguid",
  "image": "docker.io/openpai/pai.run.tensorflow",
  "dataDir": "hdfs://10.11.3.2:9000/yuan/sample/tensorflow",
  "outputDir": "$PAI_DEFAULT_FS_URI/yuan/tensorflow-distributed-jobguid/output",
  "codeDir": "$PAI_DEFAULT_FS_URI/path/tensorflow-distributed-jobguid/code",
  "virtualCluster": "default",
  "taskRoles": [
    {
      "name": "ps_server",
      "taskNumber": 2,
      "cpuNumber": 2,
      "memoryMB": 8192,
      "gpuNumber": 0,
      "portList": [
        {
          "label": "http",
          "beginAt": 0,
          "portNumber": 1
        },
        {
          "label": "ssh",
          "beginAt": 0,
          "portNumber": 1
        }
      ],
      "command": "pip --quiet install scipy && python code/tf_cnn_benchmarks.py --local_parameter_device=cpu --batch_size=32 --model=resnet20 --variable_update=parameter_server --data_dir=$PAI_DATA_DIR --data_name=cifar10 --train_dir=$PAI_OUTPUT_DIR --ps_hosts=$PAI_TASK_ROLE_ps_server_HOST_LIST --worker_hosts=$PAI_TASK_ROLE_worker_HOST_LIST --job_name=ps --task_index=$PAI_CURRENT_TASK_ROLE_CURRENT_TASK_INDEX"
    },
    {
      "name": "worker",
      "taskNumber": 2,
      "cpuNumber": 2,
      "memoryMB": 16384,
      "gpuNumber": 4,
      "portList": [
        {
          "label": "http",
          "beginAt": 0,
          "portNumber": 1
        },
        {
          "label": "ssh",
          "beginAt": 0,
          "portNumber": 1
        }
      ],
      "command": "pip --quiet install scipy && python code/tf_cnn_benchmarks.py --local_parameter_device=cpu --batch_size=32 --model=resnet20 --variable_update=parameter_server --data_dir=$PAI_DATA_DIR --data_name=cifar10 --train_dir=$PAI_OUTPUT_DIR --ps_hosts=$PAI_TASK_ROLE_ps_server_HOST_LIST --worker_hosts=$PAI_TASK_ROLE_worker_HOST_LIST --job_name=worker --task_index=$PAI_CURRENT_TASK_ROLE_CURRENT_TASK_INDEX"
    }
  ],
  "killAllOnCompletedTaskNumber": 2,
  "retryCount": 0
}

作业提交成功,但很快就失败了,大约 4 分钟后。

下面是我的“应用程序摘要”。

开始时间:2018 年 6 月 15 日,晚上 8:18:01

完成时间:2018 年 6 月 15 日,晚上 8 点 22 分 31 秒

退出诊断:

[ExitStatus]:LAUNCHER_EXIT_STATUS_UNDEFINED [ExitCode]:177 [ExitDiagnostics]:Launcher 中未定义 ExitStatus,可能是 UserApplication 本身失败。[ExitType]:未知 _________________________________________________________________________________________________________________________________________________________________________________________ [ExitCustomizedDiagnostics]:[ExitCode]:1 [ExitDiagnostics]:容器启动异常。容器 ID:container_1529064439409_0003_01_000005 退出代码:1 堆栈跟踪:ExitCodeException exitCode=1:在 org.apache.hadoop.util.Shell.runCommand(Shell.java:545) 在 org.apache.hadoop.util.Shell.run(Shell. java:456) 在 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor 的 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)。

Shell 输出:[ERROR] 在纱线容器中接收到 EXIT 信号,正在退出...

容器以非零退出代码 1 退出

______________________________________________________________________________________________________________________________________________________________________________________________ [ExitCustomizedDiagnostics]:

工人:TASK_COMPLETED:[TaskStatus]:{“taskIndex”:1,“taskRoleName”:“worker”,“taskState”:“TASK_COMPLETED”,“taskRetryPolicyState”:{“retriedCount”:0,“succeededRetriedCount”:0,“transientNormalRetriedCount” “:0,“transientConflictRetriedCount”:0,“nonTransientRetriedCount”:0,“unKnownRetriedCount”:0},“taskCreatedTimestamp”:1529065083290,“taskCompletedTimestamp”:1529065346772,“​​taskServiceStatus”:{“serviceVersion”:0},“containerId” :“container_1529064439409_0003_01_000005”,“containerHost”:“10.11.1.9”,“containerIp”:“10.11.1.9”,“containerPorts”:“http:2938;ssh:2939;”,“containerGpus”:15,“containerLogHttpAddress”:“ http://10.11.1.9:8042/node/containerlogs/container_1529064439409_0003_01_000005/admin/", "containerConnectionLostCount" : 0, "containerIsDecommissioning" : null, "containerLaunchedTimestamp" : 1529065087200, "containerCompletedTimestamp" : 1529065346768, "containerExitCode" : 1, "containerExitDiagnostics" : "Exception from container-launch.\nContainer id: container_1529064439409_0003_01_000005\nExit代码:1\n堆栈跟踪:ExitCodeException exitCode=1:\n\tat org.apache.hadoop.util.Shell.runCommand(Shell.java:545)\n\tat org.apache.hadoop.util.Shell.run( Shell.java:456)\n\tat org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)\n\tat org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer( DefaultContainerExecutor.java:212)\n\tat org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher。ContainerLaunch.call(ContainerLaunch.java:302)\n\tat org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)\n\tat java.util.concurrent。 FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java: 624)\n\tat java.lang.Thread.run(Thread.java:748)\n\nShell 输出:[ERROR] 在纱线容器中收到 EXIT 信号,正在退出 ...\n\n\n容器以非-零退出代码 1\n", "containerExitType" : "UNKNOWN" } [ContainerDiagnostics]: 容器在主机名 10.11.1.9 上完成 container_1529064439409_0003_01_000005。ContainerLogHttpAddress:hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util. concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java: 748)\n\nShell 输出:[错误] 在纱线容器中收到 EXIT 信号,正在退出 ...\n\n\n容器以非零退出代码退出 1\n", "containerExitType" : "UNKNOWN" } [ ContainerDiagnostics]:容器在主机名 10.11.1.9 上完成 container_1529064439409_0003_01_000005。ContainerLogHttpAddress:hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util. concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java: 748)\n\nShell 输出:[错误] 在纱线容器中收到 EXIT 信号,正在退出 ...\n\n\n容器以非零退出代码退出 1\n", "containerExitType" : "UNKNOWN" } [ ContainerDiagnostics]:容器在主机名 10.11.1.9 上完成 container_1529064439409_0003_01_000005。ContainerLogHttpAddress:运行(FutureTask.java:266)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) \n\tat java.lang.Thread.run(Thread.java:748)\n\nShell 输出:[ERROR] 在纱线容器中收到 EXIT 信号,正在退出 ...\n\n\n容器以非零值退出退出代码 1\n", "containerExitType" : "UNKNOWN" } [ContainerDiagnostics]: 容器在 HostName 10.11.1.9 上完成 container_1529064439409_0003_01_000005。ContainerLogHttpAddress:运行(FutureTask.java:266)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) \n\tat java.lang.Thread.run(Thread.java:748)\n\nShell 输出:[ERROR] 在纱线容器中收到 EXIT 信号,正在退出 ...\n\n\n容器以非零值退出退出代码 1\n", "containerExitType" : "UNKNOWN" } [ContainerDiagnostics]: 容器在 HostName 10.11.1.9 上完成 container_1529064439409_0003_01_000005。ContainerLogHttpAddress:[错误] 在纱线容器中收到退出信号,正在退出 ...\n\n\n容器以非零退出代码 1\n", "containerExitType" : "UNKNOWN" } [ContainerDiagnostics]: 容器在 HostName 上完成 container_1529064439409_0003_01_000005 10.11.1.9. ContainerLogHttpAddress:[错误] 在纱线容器中收到退出信号,正在退出 ...\n\n\n容器以非零退出代码 1\n", "containerExitType" : "UNKNOWN" } [ContainerDiagnostics]: 容器在 HostName 上完成 container_1529064439409_0003_01_000005 10.11.1.9. ContainerLogHttpAddress: http://10.11.1.9:8042/node/containerlogs/container_1529064439409_0003_01_000005/admin/ AppCacheNetworkPath: 10.11.1.9:/var/lib/hadoopdata/nm-local-dir/usercache/admin/appcache/application_1529064439409_00111.ContainerLogNetworkPath:1. /var/lib/yarn/userlogs/application_1529064439409_0003/container_1529064439409_0003_01_000005 ______________________________________________________________________________________________________________________________________________________________________________ [AMStopReason]:任务工作者已完成并已启用 KillAllOnAnyCompleted。

找到更多日志详细信息:

[INFO] hdfs_ssh_folder is hdfs://10.11.3.2:9000/Container/admin/yuan_tensorflow-distributed-2/ssh/application_1529064439409_0450
[INFO] task_role_no is 0
[INFO] PAI_TASK_INDEX is 1
[INFO] waitting for ssh key ready
[INFO] waitting for ssh key ready
[INFO] ssh key pair ready ...
[INFO] begin to download ssh key pair from hdfs ...
[INFO] start ssh service
 * Restarting OpenBSD Secure Shell server sshd       [80G 
[74G[ OK ]
[INFO] USER COMMAND START

Traceback (most recent call last):
  File "code/tf_cnn_benchmarks.py", line 38, in <module>
    import benchmark_storage
ImportError: No module named benchmark_storage
[DEBUG] EXIT signal received in docker container, exiting ...

结论:

代码不完整,需要一些依赖。下面我提供一个工作作业配置。

{
  "jobName": "tensorflow-cifar10",
  "image": "openpai/pai.example.tensorflow",

  "dataDir": "/tmp/data",
  "outputDir": "/tmp/output",

  "taskRoles": [
    {
      "name": "cifar_train",
      "taskNumber": 1,
      "cpuNumber": 8,
      "memoryMB": 32768,
      "gpuNumber": 1,
      "command": "git clone https://github.com/tensorflow/models && cd models/research/slim && python download_and_convert_data.py --dataset_name=cifar10 --dataset_dir=$PAI_DATA_DIR && python train_image_classifier.py --batch_size=64 --model_name=inception_v3 --dataset_name=cifar10 --dataset_split_name=train --dataset_dir=$PAI_DATA_DIR --train_dir=$PAI_OUTPUT_DIR"
    }
  ]
}
4

2 回答 2

0

通常,您需要查看所有工作人员的日志,尤其是第一个退出的容器以查看那里发生了什么,因为任何退出的容器都会导致 Launcher 提前停止作业,因此您可以在应用程序诊断中看到“在纱线容器中收到退出信号”消息内容。

于 2018-06-19T11:06:02.010 回答
0

失败作业的日志不会被删除。作业完成后将其移至 hdfs。

从您的日志来看,代码似乎遗漏了一些文件。请下载基准测试的整个文件夹,而不是仅下载一两个文件(如 cnn 基准测试)。

于 2018-06-20T08:39:19.510 回答