我当前的设置:
- 使用 HDFS 和 YARN 的 Spark EC2 集群
- JuputerHub(0.7.0)
- PySpark 内核与 python27
我用于这个问题的非常简单的代码:
rdd = sc.parallelize([1, 2])
rdd.collect()
在 Spark 独立版中按预期工作的 PySpark 内核在内核 json 文件中具有以下环境变量:
"PYSPARK_SUBMIT_ARGS": "--master spark://<spark_master>:7077 pyspark-shell"
但是,当我尝试在 yarn-client 模式下运行时,它会永远卡住,而 JupyerHub 日志的日志输出是:
16/12/12 16:45:21 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:45:36 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:45:51 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:46:06 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
如此处所述,我添加了HADOOP_CONF_DIR 环境。变量指向 Hadoop 配置所在的目录,并将PYSPARK_SUBMIT_ARGS --master
属性更改为“ yarn-client ”。此外,我可以确认在此期间没有其他作业在运行,并且工人已正确注册。
我的印象是可以像其他人那样配置带有 PySpark 内核的 JupyterHub Notebook 以与 YARN 一起运行,如果确实是这种情况,我做错了什么?