0

我正在尝试在 AWS EMR 集群中运行 GeoSpark。代码是:

#  coding=utf-8

from pyspark.sql import SparkSession
import pyspark.sql.functions as f
import pyspark.sql.types as t
from geospark.register import GeoSparkRegistrator
from geospark.utils import GeoSparkKryoRegistrator
from geospark.register import upload_jars

import config as cf

import yaml


if __name__ == "__main__":
    # Read files
    with open("/tmp/param.yml", 'r') as ymlfile:
        param = yaml.load(ymlfile, Loader=yaml.SafeLoader)
    
    # Register jars
    upload_jars()

    # Creation of spark session
    print("Creating Spark session")
    spark = SparkSession \
        .builder \
        .getOrCreate()
    
    GeoSparkRegistrator.registerAll(spark)

我在函数中收到以下错误upload_jars()

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/findspark.py", line 143, in init
    py4j = glob(os.path.join(spark_python, "lib", "py4j-*.zip"))[0]
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "geo_processing.py", line 21, in <module>
    upload_jars()
  File "/usr/local/lib/python3.7/site-packages/geospark/register/uploading.py", line 39, in upload_jars
    findspark.init()
  File "/usr/local/lib/python3.7/site-packages/findspark.py", line 146, in init
    "Unable to find py4j, your SPARK_HOME may not be configured correctly"
Exception: Unable to find py4j, your SPARK_HOME may not be configured correctly

我该如何解决这个错误?

4

1 回答 1

1

解决方案

您应该upload_jars()从您的代码中删除并以另一种方式加载 jars,或者通过将它们复制到SPARK_HOME(从/usr/lib/sparkemr-4.0.0 开始)作为 EMR 引导操作的一部分,或者在您的spark-submit命令中使用该--jars选项。

解释

我无法让该upload_jars()功能在多节点 EMR 集群上运行。根据geospark 文档upload_jars()

使用 findspark Python 包将 jar 文件上传到执行器和节点。为了避免一直复制,可以将 jar 文件放在目录 SPARK_HOME/jars 或 Spark 配置文件中指定的任何其他路径中。

Spark 在 EMR 上以 YARN 模式安装,这意味着它只安装在主节点上,而不是核心/任务节点上。因此,findspark不会在核心/任务节点上找到 Spark,因此您会收到错误消息Unable to find py4j, your SPARK_HOME may not be configured correctly

于 2021-01-13T11:41:21.363 回答