apache-spark - 使用 spark-jobserver 提交 spark 作业时出错

Question

我在提交作业时偶尔会遇到以下错误。如果我删除了filedao、datadao和sqldao的rootdir，这个错误就会消失。这意味着我必须重新启动作业服务器并重新上传我的 jar。

{
  "status": "ERROR",
  "result": {
    "message": "Ask timed out on [Actor[akka://JobServer/user/context-supervisor/1995aeba-com.spmsoftware.distributed.job.TestJob#-1370794810]] after [10000 ms]. Sender[null] sent message of type \"spark.jobserver.JobManagerActor$StartJob\".",
    "errorClass": "akka.pattern.AskTimeoutException",
    "stack": ["akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)", "akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)", "scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)", "scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)", "scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)", "akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:331)", "akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:282)", "akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:286)", "akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:238)", "java.lang.Thread.run(Thread.java:745)"]
  }
}

我的配置文件如下：

# Template for a Spark Job Server configuration file
# When deployed these settings are loaded when job server starts
#
# Spark Cluster / Job Server configuration
# Spark Cluster / Job Server configuration
spark {
  # spark.master will be passed to each job's JobContext
  master = <spark_master>

  # Default # of CPUs for jobs to use for Spark standalone cluster
  job-number-cpus = 4

  jobserver {
    port = 8090

    context-per-jvm = false
    context-creation-timeout = 100 s
    # Note: JobFileDAO is deprecated from v0.7.0 because of issues in
    # production and will be removed in future, now defaults to H2 file.
    jobdao = spark.jobserver.io.JobSqlDAO

    filedao {
      rootdir = /tmp/spark-jobserver/filedao/data
    }

    datadao {
      rootdir = /tmp/spark-jobserver/upload
    }

    sqldao {
      slick-driver = slick.driver.H2Driver

      jdbc-driver = org.h2.Driver

      rootdir = /tmp/spark-jobserver/sqldao/data

      jdbc {
        url = "jdbc:h2:file:/tmp/spark-jobserver/sqldao/data/h2-db"
        user = ""
        password = ""
      }

      dbcp {
        enabled = false
        maxactive = 20
        maxidle = 10
        initialsize = 10
      }
    }
    result-chunk-size = 1m
    short-timeout = 60 s    
  }

  context-settings {
    num-cpu-cores = 2           # Number of cores to allocate.  Required.
    memory-per-node = 512m         # Executor memory per node, -Xmx style eg 512m, #1G, etc.

  }

}

akka {
  remote.netty.tcp {
    # This controls the maximum message size, including job results, that can be sent
    # maximum-frame-size = 200 MiB
  }
}

# check the reference.conf in spray-can/src/main/resources for all defined settings
spray.can.server.parsing.max-content-length = 250m

我正在使用spark-2.0-preview版本。

score 0 · Accepted Answer

我之前遇到过同样的错误并且与超时有关，当然是同步请求（sync = true），您必须提供超时（以秒为单位），它是一个与处理您的请求所需的时间相关的值。

这是请求应如何显示的示例：

curl -k --basic -d '' 'http://localhost:5050/jobs?appName=app&classPath=Main&context=test-context&sync=true&timeout=40'

如果您的请求需要超过 40 秒，您可能还需要修改位于

spark-jobserver-master/job-server/src/main/resources/application.conf

ànd 在spray.can.server部分修改：

idle-timeout = 210 s
request-timeout = 200 s

apache-spark - 使用 spark-jobserver 提交 spark 作业时出错

1 回答 1

Related

Reference