0

我当前的设置:

  • 使用 HDFS 和 YARN 的 Spark EC2 集群
  • JuputerHub(0.7.0)
  • PySpark 内核与 python27

我用于这个问题的非常简单的代码:

rdd = sc.parallelize([1, 2])
rdd.collect()

在 Spark 独立版中按预期工作的 PySpark 内核在内核 json 文件中具有以下环境变量:

"PYSPARK_SUBMIT_ARGS": "--master spark://<spark_master>:7077 pyspark-shell"

但是,当我尝试在 yarn-client 模式下运行时,它会永远卡住,而 JupyerHub 日志的日志输出是:

16/12/12 16:45:21 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:45:36 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:45:51 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:46:06 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

如此处所述我添加了HADOOP_CONF_DIR 环境。变量指向 Hadoop 配置所在的目录,并将PYSPARK_SUBMIT_ARGS --master属性更改为“ yarn-client ”。此外,我可以确认在此期间没有其他作业在运行,并且工人已正确注册。

我的印象是可以像其他人那样配置带有 PySpark 内核的 JupyterHub Notebook 以与 YARN 一起运行,如果确实是这种情况,我做错了什么?

4

2 回答 2

2

为了让你的 pyspark 在纱线模式下工作,你必须做一些额外的配置:

  1. 通过在您的 jupyter 实例中复制hadoop-yarn-server-web-proxy-<version>.jar您的纱线集群的配置来为远程纱线连接配置纱线 (您需要一个本地 hadoop)<local hadoop directory>/hadoop-<version>/share/hadoop/yarn/

  2. hive-site.xml您的集群复制到<local spark directory>/spark-<version>/conf/

  3. yarn-site.xml您的集群复制到<local hadoop directory>/hadoop-<version>/hadoop-<version>/etc/hadoop/

  4. 设置环境变量:

    • export HADOOP_HOME=<local hadoop directory>/hadoop-<version>
    • export SPARK_HOME=<local spark directory>/spark-<version>
    • export HADOOP_CONF_DIR=<local hadoop directory>/hadoop-<version>/etc/hadoop
    • export YARN_CONF_DIR=<local hadoop directory>/hadoop-<version>/etc/hadoop
  5. 现在,您可以创建内核vim /usr/local/share/jupyter/kernels/pyspark/kernel.json { "display_name": "pySpark (Spark 2.1.0)", "language": "python", "argv": [ "/opt/conda/envs/python35/bin/python", "-m", "ipykernel", "-f", "{connection_file}" ], "env": { "PYSPARK_PYTHON": "/opt/conda/envs/python35/bin/python", "SPARK_HOME": "/opt/mapr/spark/spark-2.1.0", "PYTHONPATH": "/opt/mapr/spark/spark-2.1.0/python/lib/py4j-0.10.4-src.zip:/opt/mapr/spark/spark-2.1.0/python/", "PYTHONSTARTUP": "/opt/mapr/spark/spark-2.1.0/python/pyspark/shell.py", "PYSPARK_SUBMIT_ARGS": "--master yarn pyspark-shell" } }

  6. 重新启动你的 jupyterhub,你应该会看到 pyspark。由于 uid=1,root 用户通常没有 yarn 权限。您应该与另一个用户连接到 jupyterhub

于 2017-08-15T20:20:43.513 回答
0

希望我的案例能帮到你。

我通过简单地传递一个参数来配置 url:

import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext("yarn-clinet", "First App")
于 2018-05-02T01:44:19.563 回答