python - 如何在 Amazon EMR 上运行 PySpark 作业（使用自定义模块）？

Question

我想运行一个在我的（本地）机器上运行良好的 PySpark 程序。

我正在运行一个 Amazon Elastic Map Reduce 集群，并安装了所有必要的依赖项（Spark、来自 PyPI 的 Python 模块）。

现在，如何运行使用一些自定义模块的 PySpark 作业？我已经尝试了大概半天的很多东西，现在，无济于事。到目前为止我发现的最好的命令是：

/home/hadoop/spark/bin/spark-submit --master yarn-cluster \
    --py-files s3://bucket/custom_module.py s3://bucket/pyspark_program.py

但是，Python 失败了，因为它找不到custom_module.py. 不过，它似乎试图复制它：

INFO yarn.Client：上传资源 s3://bucket/custom_module.py -> hdfs://...:9000/user/hadoop/.sparkStaging/application_..._0001/custom_module.py

信息 s3n.S3NativeFileSystem：打开 's3://bucket/custom_module.py' 进行阅读

这看起来是一个非常基本的问题，但是网络对此非常沉默，包括官方文档（Spark 文档似乎暗示了上面的命令）。

score 0 · Accepted Answer

这是Spark 1.3.0的错误。

解决方法包括SPARK_HOME为 YARN 定义，即使这应该是不必要的：

spark-submit … --conf spark.yarn.appMasterEnv.SPARK_HOME=/home/hadoop/spark \
               --conf spark.executorEnv.SPARK_HOME=/home/hadoop/spark …

python - 如何在 Amazon EMR 上运行 PySpark 作业（使用自定义模块）？

1 回答 1

Related

Reference