我按照https://cloud.google.com/dataproc/docs/concepts/compute/gpus上的教程创建了一个单节点 n1-standard-16 Dataproc 集群(基本映像为:1.5.35-debian10)并附加特斯拉 T4 GPU。我在创建集群后安装了 NVIDIA 驱动程序,并且能够成功运行在 GPU 上运行的 Spark 作业。
但是,当我停止主实例,再次启动它并提交一个新的 Dataproc 作业时,它在 5 分钟后失败并显示“未获取任务”,并且找不到在同一集群中运行作业的任何方法。
任何帮助表示赞赏。
编辑:按照@Dagang 的建议调查 /var/log/hadoop-yarn 文件夹中的 hadoop-yarn 日志后,似乎它与 YARN 节点管理器有关。节点管理器失败并显示以下消息。
Edit2:主要失败原因是“意外:无法创建纱线 cgroup 子系统:设备挂载点:/proc/mounts 用户:纱线路径:/sys/fs/cgroup/devices/yarn”。按照Hadoop官网的建议:安装脚本中需要运行以下几行:
chown :yarn -R /sys/fs/cgroup/cpu,cpuacct
chmod g+rwx -R /sys/fs/cgroup/cpu,cpuacct
chown :yarn -R /sys/fs/cgroup/devices
chmod g+rwx -R /sys/fs/cgroup/devices
但是这些命令已经存在于install_gpu_drivers.sh
完整的错误日志:
Specified path is a directory, use nvidia-smi under the directory, updated path-to-executable
:/usr/bin/nvidia-smi
Trying to discover GPU information ...
=== Gpus in the system ===
Driver Version:460.73.01
ProductName=Tesla T4, MinorNumber=0, TotalMemory=15109MiB, Utilization=0.0%
CGroup controller already mounted at: /sys/fs/cgroup/devices
Initializing mounted controller devices at /sys/fs/cgroup/devices/yarn
Yarn control group does not exist. Creating /sys/fs/cgroup/devices/yarn
Failed to bootstrap configured resource subsystems!
Unexpected: Cannot create yarn cgroup Subsystem:devices Mount points:/proc/mounts User:yarn Path:/sys/fs/cgroup/de
vices/yarn
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl.initializePreMountedCGroupController(CGroupsHandlerImpl.java:424)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl.initializeCGroupController(CGroupsHandlerImpl.java:376)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceHandlerImpl.bootstrap(GpuResourceHandlerImpl.java:86)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.bootstrap(ResourceHandlerChain.java:58)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:316)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:391)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:878)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:946)
2021-09-17 07:18:56,215 INFO org.apache.hadoop.service.AbstractService: Service NodeManager failed in state INITED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize container execu
tor
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize container executor
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:393)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:878)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:946)
Caused by: java.io.IOException: Failed to bootstrap configured resource subsystems!
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:320)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:391)
... 3 more
2021-09-17 07:18:56,216 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting NodeManager
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize container executor
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:393)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:878)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:946)
Caused by: java.io.IOException: Failed to bootstrap configured resource subsystems!
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:320)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:391)
... 3 more
2021-09-17 07:18:56,219 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: SHUTDOWN_MSG:
如何重现:我使用 Google Dataproc Node.js 客户端库来创建 Dataproc 集群。这是我的配置:
request: ClusterCreateInput = {
projectId: 'my-project-id',
region: 'europe-west-1',
cluster: {
projectId: 'my-project-id',
clusterName: 'test-cluster',
config: {
configBucket: 'my-dataproc-log-bucket',
gceClusterConfig: {
zoneUri: 'europe-west-1-d',
serviceAccountScopes: [
'https://www.googleapis.com/auth/cloud-platform',
'https://www.googleapis.com/auth/cloud.useraccounts.readonly',
'https://www.googleapis.com/auth/devstorage.read_write',
'https://www.googleapis.com/auth/logging.write',
],
},
masterConfig: {
numInstances: 1,
machineTypeUri: `n1-standard-16`,
diskConfig: {
bootDiskSizeGb: 100,
bootDiskType: 'pd-standard',
numLocalSsds: 0,
},
accelerators: [
{
acceleratorTypeUri: 'nvidia-tesla-t4',
acceleratorCount: 1,
},
],
imageUri: '1.5.35-debian10',
},
softwareConfig: {
properties: {
'dataproc:dataproc.allow.zero.workers': true,
'spark:spark.executor.instances': '1',
'spark:spark.executor.cores': '16',
'spark:spark.default.parallelism': '16',
'spark:spark.executor.memory': `38000m`,
'spark-env:ARROW_PRE_0_15_IPC_FORMAT': '1',
'spark:spark.executorEnv.LD_PRELOAD': 'libnvblas.so',
},
},
initializationActions: [
{
executableFile: `gs://my-bucket-name/install_gpu_driver.sh`,
},
],
},
},
}
const [clusterOperation] = await this.client.createCluster(request)
const [result] = await clusterOperation.promise()
创建集群后,提交基本作业。作业完成后停止并启动 VM,然后重新提交类似的作业。集群将不会获取此新作业。