0

我们的设置配置为在 AWS 上有一个默认的 Data Lake,使用 S3 作为存储,Glue Catalog 作为我们的元存储。

我们开始使用 Apache Hudi,我们可以按照 de AWS 文档让它工作。问题是,当使用文档中指示的配置和 JAR 时,我们无法spark.sql在 Glue 元存储上运行。

以下是一些信息。

我们正在创建集群boto3

emr.run_job_flow(
    Name='hudi_debugging',
    LogUri='s3n://mybucket/elasticmapreduce/',
    ReleaseLabel='emr-5.28.0',
    Applications=[
        {
            'Name': 'Spark'
        },
        {
            'Name': 'Hive'
        },
        {
            'Name': 'Hadoop'
        }
    ],
    Instances={
        'InstanceGroups': [
            {
                'Name': 'Core Instance Group',
                'Market': 'SPOT',
                'InstanceCount': 3,
                'EbsConfiguration': {'EbsBlockDeviceConfigs': [
                    {'VolumeSpecification': {'SizeInGB': 16, 'VolumeType': 'gp2'},
                     'VolumesPerInstance': 1}]},
                'InstanceRole': 'CORE',
                'InstanceType': 'm5.xlarge'
            },
            {
                'Name': 'Master Instance Group',
                'Market': 'ON_DEMAND',
                'InstanceCount': 1,
                'EbsConfiguration': {'EbsBlockDeviceConfigs': [
                    {'VolumeSpecification': {'SizeInGB':16, 'VolumeType': 'gp2'},
                     'VolumesPerInstance': 1}]},
                'InstanceRole': 'MASTER',
                'InstanceType': 'm5.xlarge'
            }
        ],
        'Ec2KeyName': 'dataengineer',
        'Ec2SubnetId': 'mysubnet',
        'EmrManagedMasterSecurityGroup': 'mysg',
        'EmrManagedSlaveSecurityGroup': 'mysg',
        'KeepJobFlowAliveWhenNoSteps': True,
    },
    Configurations=[
        {
            'Classification': 'hive-site',
            'Properties': {
                'hive.metastore.client.factory.class': 'com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory'}
        },
        {
            'Classification': 'spark-hive-site',
            'Properties': {
                'hive.metastore.client.factory.class': 'com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory'}
        },
        {
            "Classification": "spark-env",
            "Configurations": [
                {
                    "Classification": "export",
                    "Properties": {
                        "PYSPARK_PYTHON": "/usr/bin/python3",
                        "PYSPARK_DRIVER_PYTHON": "/usr/bin/python3",
                        "PYTHONIOENCODING": "utf8"
                    }
                }
            ]
        },
        {
            "Classification": "spark-defaults",
            "Properties": {
                "spark.sql.execution.arrow.enabled": "true"
            }
        },
        {
            "Classification": "spark",
            "Properties": {
                "maximizeResourceAllocation": "true"
            }
        }
    ],
    BootstrapActions=[
        {
            'Name': 'Bootstrap action',
            'ScriptBootstrapAction': {
                'Path': 's3://mybucket/bootstrap_emr.sh',
                'Args': []
            }
        },
    ],
    Steps=[
        {
            'Name': 'Setup Hadoop Debugging',
            'ActionOnFailure': 'TERMINATE_CLUSTER',
            'HadoopJarStep': {
                'Jar': 'command-runner.jar',
                'Args': ['state-pusher-script']
            }
        }
    ],
    JobFlowRole='EMR_EC2_DefaultRole',
    ServiceRole='EMR_DefaultRole',
    ScaleDownBehavior='TERMINATE_AT_TASK_COMPLETION',
    VisibleToAllUsers=True
)

我们使用上述文档中的示例启动 pyspark shell:

pyspark \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--conf "spark.sql.hive.convertMetastoreParquet=false" \
--jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar

然后,在 shell 内部,当我们运行时,spark.sql("show tables")我们会收到以下错误:

Using Python version 3.7.9 (default, Aug 27 2020 21:59:41)
SparkSession available as 'spark'.
>>> spark.sql("show tables").show()
21/04/09 19:49:21 WARN HiveConf: HiveConf of name hive.server2.thrift.url does not exist
Traceback (most recent call last):
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o67.sql.
: org.apache.spark.sql.AnalysisException: java.lang.NoSuchMethodError: org.apache.http.conn.ssl.SSLConnectionSocketFactory.<init>(Ljavax/net/ssl/SSLContext;Ljavax/net/ssl/HostnameVerifier;)V;
        at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:128)
        at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:238)
        at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:117)
        ...
    ... 44 more


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/sql/session.py", line 767, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'java.lang.NoSuchMethodError: org.apache.http.conn.ssl.SSLConnectionSocketFactory.<init>(Ljavax/net/ssl/SSLContext;Ljavax/net/ssl/HostnameVerifier;)V;'

我们还尝试将其作为一个步骤提交deploy-mode clientcluster并得到了类似的结果。

Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o66.sql.
: org.apache.spark.sql.AnalysisException: java.lang.NoSuchMethodError: org.apache.http.conn.ssl.SSLConnectionSocketFactory.<init>(Ljavax/net/ssl/SSLContext;Ljavax/net/ssl/HostnameVerifier;)V;
    at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:128)
    at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:238)
    at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:117)
    ...
    at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:239)
    at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
    ... 97 more


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/tmp/spark-cd3c9851-5edd-4c6f-abf8-2f4a0af807e3/spark_sql_test.py", line 7, in <module>
    _ = spark.sql("select * from dl_raw.pagnet_clients limit 10")
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 767, in sql
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: 'java.lang.NoSuchMethodError: org.apache.http.conn.ssl.SSLConnectionSocketFactory.<init>(Ljavax/net/ssl/SSLContext;Ljavax/net/ssl/HostnameVerifier;)V;'
4

2 回答 2

2

spark.sql.hive.convertMetastoreParquet 需要设置为 true。

于 2021-04-16T22:29:09.143 回答
0

请在 github.com/apache/hudi/issues 中打开一个问题以从 hudi 社区获得帮助。

于 2021-04-12T11:46:42.843 回答