0

支持 GPU 资源的 mesos 框架只有两个:Marathon 和 Aurora。我想在具有 GPU 资源的 mesos 代理上启动批处理作业。所以,只有极光支持这样的工作。但是目前dcos官方不支持Aurora。我试图整合但没有成功。DCOS Mesos 大师不注册 Aurora 框架,但参展商为 Aurora 创建记录。我没有设法在 mesos masters 日志中找到有关 Aurora 的任何记录。这是我的极光调度器配置:

 #!/bin/bash

 GLOG_v=0
 LIBPROCESS_PORT=8083
 #LIBPROCESS_IP=127.0.0.1

 JAVA_HOME=/opt/mesosphere/active/java/usr/java

 JAVA_OPTS="-server -Djava.library.path='/opt/mesosphere/lib;/usr/lib;/usr/lib64'"

 PATH=$PATH:/opt/mesosphere/bin

 MESOS_NATIVE_JAVA_LIBRARY=/opt/mesosphere/lib/libmesos.so

 LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/mesosphere/lib

 JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/opt/mesosphere/lib

 # Flags control the behavior of the Aurora scheduler.
 # For a full list of available flags, run /usr/lib/aurora/bin/aurora-scheduler -help
 AURORA_FLAGS=(
    # The name of this cluster.
   -cluster_name='My Cluster'

    # The HTTP port upon which Aurora will listen.
   -http_port=8088

    # The ZooKeeper URL of the ZNode where the Mesos master has registered.
    -mesos_master_address=zk://master_ip1:2181,master_ip2:2181,master_ip3:2181/mesos

    # The ZooKeeper quorum to which Aurora will register itself.
    -zk_endpoints=master_ip1:2181,master_ip1:2181,master_ip1:2181

    # The ZooKeeper ZNode within the specified quorum to which Aurora will register its
    # ServerSet, which keeps track of all live Aurora schedulers.
    -serverset_path='/aurora/scheduler'

    # Allows the scheduling of containers of the provided type.
    -allowed_container_types='DOCKER,MESOS'

    -allow_docker_parameters=true
    -allow_gpu_resource=true
    -executor_user=root
    ### Native Log Settings ###

    # The native log serves as a replicated database which stores the state of the
    # scheduler, allowing for multi-master operation.

    # Size of the quorum of Aurora schedulers which possess a native log.  If running in
    # multi-master mode, consult the following document to determine appropriate values:
    #
    # https://aurora.apache.org/documentation/latest/deploying-aurora-scheduler/#replicated-log-configuration
    -native_log_quorum_size=2
    # The ZooKeeper ZNode to which Aurora will register the locations of its replicated log.
    -native_log_zk_group_path='/aurora/replicated-log'
    # The local directory in which an Aurora scheduler can find Aurora's replicated log.
    -native_log_file_path='/var/lib/aurora/scheduler/db'
    # The local directory in which Aurora schedulers will place state backups.
    -backup_dir='/var/lib/aurora/scheduler/backups'

   ### Thermos Settings ###

   # The local path of the Thermos executor binary.
    -thermos_executor_path='/usr/bin/thermos_executor'
   # Flags to pass to the Thermos executor.
    -thermos_executor_flags='--announcer-ensemble 127.0.0.1:2181')
4

1 回答 1

1

我设法在 DC/OS 1.8 上启动了 Aurora 框架。由于 mesos 和 java 嵌入到 DS/OS 并具有自定义配置,特别是我必须用 docker 隔离极光的路径。因此,您可以在我的 docker repo 中找到 Aurora 组件的 docker 镜像: Aurora schedulerAurora executor。这也允许我或其他人创建一个 Universe 包。

在 DC/OS 上部署 Aurora 调度程序的步骤:

  1. /var/lib/aurora在每个 DC/OS 代理上创建文件夹

  2. 使用下一个 JSON 在所有 DC/OS 代理上启动 aurora 执行器:

    {
      "id": "/aurora/aurora-executor",
      "env": {
        "MESOS_ROOT": "/var/lib/mesos/slave"
      },
      "instances": 20,
      "cpus": 1,
      "mem": 128,
      "disk": 0,
      "gpus": 0,
      "constraints": [
        [
          "hostname",
          "UNIQUE"
        ]
      ],
      "container": {
        "docker": {
          "image": "krot/aurora-executor",
          "forcePullImage": true,
          "privileged": false,
          "network": "HOST"
        },
        "type": "DOCKER",
        "volumes": [
          {
            "containerPath": "/var/lib/mesos/slave",
            "hostPath": "/var/lib/mesos/slave",
            "mode": "RW"
          },
          {
            "containerPath": "/var/lib/aurora",
            "hostPath": "/var/lib/aurora",
            "mode": "RW"
          }
        ]
      }
    }
    

    笔记。设置"instances"为代理数。

    2a。aurora executor 部署的替代方式(应在每个 DC/OS 代理上完成):

     sudo yum install -y python2 wget
     wget -c https://apache.bintray.com/aurora/centos-7/aurora-executor-0.16.0-1.el7.centos.aurora.x86_64.rpm
     rpm -Uhv --nodeps aurora-executor-0.16.0-1.el7.centos.aurora.x86_64.rpm
    

    进行编辑以添加--mesos-root标志,结果如下:

    grep -A5 OBSERVER_ARGS /etc/sysconfig/thermos
    OBSERVER_ARGS=(
       --port=1338
       --mesos-root=/var/lib/mesos/slave
       --log_to_disk=NONE
       --log_to_stderr=google:INFO
    )
    
  3. 使用下一个 JSON 启动 aurora 调度程序(建议使用 3 个或更多实例以实现容错):

    {
          "id": "/aurora/aurora-scheduler",
          "env": {
            "CLUSTER_NAME": "YourCluster",
            "ZK_ENDPOINTS": "master.mesos:2181",
            "MESOS_MASTER": "zk://master.mesos:2181/mesos",
            "QUORUM_SIZE": "2",
            "EXTRA_SCHEDULER_ARGS": "-allow_gpu_resource=true"
          },
          "instances": 3,
          "cpus": 1,
          "mem": 1024,
          "disk": 0,
          "gpus": 0,
          "constraints": [
            [
              "hostname",
              "UNIQUE"
            ]
          ],
          "container": {
            "docker": {
              "image": "krot/aurora-scheduler",
              "forcePullImage": true,
              "privileged": false,
              "network": "HOST"
            },
            "type": "DOCKER",
            "volumes": [
              {
                "containerPath": "/var/lib/aurora",
                "hostPath": "/var/lib/aurora",
                "mode": "RW"
              }
            ]
          }
    }
    

    笔记。 -allow_gpu_resource=true启用 GPU 支持。可以使用环境变量配置 Aurora 调度程序。有关详细信息,请参阅文档

于 2016-11-17T11:40:11.933 回答