7

The problem is that you need to launch separate JVM to create separate session with different number of RAM per job.

How to submit few Spark applications simultaneously without manually spawning separate JVMs?

My app is run on single server, within single JVM. That appears a problem with Spark session per JVM paradigm. Spark paradigm says:

1 JVM => 1 app => 1 session => 1 context => 1 RAM/executors/cores config

I'd like to have different configurations per Spark application without launching extra JVMs manually. Configurations:

  1. spark.executor.cores
  2. spark.executor.memory
  3. spark.dynamicAllocation.maxExecutors
  4. spark.default.parallelism

Usecase

You have started long running job, say 4-5 hours to complete. The job is run within a session with configs spark.executor.memory=28GB, spark.executor.cores=2. Now you want to launch 5-10 seconds job on user demand, without waiting 4-5 hours. This tinny job need 1GB of RAM. What would you do? Submit tinny job from behalf of long-running-job-session? Than it will claim 28GB ((

What I've found

  1. Spark allow you to configure number of CPU and executors only on the session level. Spark scheduling pool allow you to slide and dice only number of cores, not a RAM or executors, right?
  2. Spark Job Server. But they does't support Spark newer than 2.0, not an option for me. But they actually solve the problem for versions older than 2.0. In Spark JobServer features they said Separate JVM per SparkContext for isolation (EXPERIMENTAL), which means spawn new JVM per context
  3. Mesos fine-grained mode is deprecated
  4. This hack, but it's too risky to use it in production.
  5. Hidden Apache Spark REST API for job submission, read this and this. There is definitely way to specify executor memory and cores there, but still what is the behavior on submitting two jobs with different configs? As I understand this is Java REST client for it.
  6. Livy. Not familiar with it, but looks they have Java API only for batch submission, which is not an option for me.
4

2 回答 2

6

有了一个用例,现在就清楚多了。有两种可能的解决方案:

如果您需要在这些作业之间共享数据,请使用 FAIR 调度程序和 (REST-) 前端(如 SparkJobServer、Livy 等)。您也不需要使用 SparkJobServer,如果您有固定的范围,它应该相对容易编码。我已经看到项目朝着这个方向发展。您所需要的只是一个事件循环和一种将传入查询转换为 Spark 查询的方法。在某种程度上,我希望有一个库来涵盖这个用例,因为当你在基于 Spark 的应用程序/框架上工作时,它几乎总是你必须构建的第一件事。在这种情况下,您可以根据您的硬件调整执行器的大小,Spark 将管理您的作业调度。借助 Yarn 的动态资源分配,如果您的框架/应用程序空闲,Yarn 还将释放资源(杀死执行程序)。 有关更多信息,请在此处阅读:http: //spark.apache.org/docs/latest/job-scheduling.html

如果您不需要共享数据,请使用 YARN(或其他资源管理器)以公平的方式将资源分配给两个作业。YARN 具有公平调度模式,您可以设置每个应用程序的资源需求。如果您认为这适合您,但您需要共享数据,那么您可能需要考虑使用 Hive 或 Alluxio 来提供数据接口。在这种情况下,您将运行两个 spark-submits,并在集群中维护多个驱动程序。围绕 spark-submit 构建额外的自动化可以帮助您减少对最终用户的烦扰和更透明。这种方法也是高延迟的,因为资源分配和 SparkSession 初始化占用了或多或少恒定的时间量。

于 2017-05-16T14:48:34.510 回答
4

tl;博士我会说这是不可能的。

Spark 应用程序至少是一个 JVM,并且spark-submit在您指定单个 JVM(或一组充当执行程序的 JVM)的要求时。

但是,如果您想在不启动单独的 JVM 的情况下拥有不同的 JVM 配置,那似乎是不可能的(即使在 Spark 之外,但假设 JVM 正在使用中)。

于 2017-05-16T13:56:27.440 回答