我们的设置配置为在 AWS 上有一个默认的 Data Lake,使用 S3 作为存储,Glue Catalog 作为我们的元存储。
我们开始使用 Apache Hudi,我们可以按照 de AWS 文档让它工作。问题是,当使用文档中指示的配置和 JAR 时,我们无法spark.sql
在 Glue 元存储上运行。
以下是一些信息。
我们正在创建集群boto3
:
emr.run_job_flow(
Name='hudi_debugging',
LogUri='s3n://mybucket/elasticmapreduce/',
ReleaseLabel='emr-5.28.0',
Applications=[
{
'Name': 'Spark'
},
{
'Name': 'Hive'
},
{
'Name': 'Hadoop'
}
],
Instances={
'InstanceGroups': [
{
'Name': 'Core Instance Group',
'Market': 'SPOT',
'InstanceCount': 3,
'EbsConfiguration': {'EbsBlockDeviceConfigs': [
{'VolumeSpecification': {'SizeInGB': 16, 'VolumeType': 'gp2'},
'VolumesPerInstance': 1}]},
'InstanceRole': 'CORE',
'InstanceType': 'm5.xlarge'
},
{
'Name': 'Master Instance Group',
'Market': 'ON_DEMAND',
'InstanceCount': 1,
'EbsConfiguration': {'EbsBlockDeviceConfigs': [
{'VolumeSpecification': {'SizeInGB':16, 'VolumeType': 'gp2'},
'VolumesPerInstance': 1}]},
'InstanceRole': 'MASTER',
'InstanceType': 'm5.xlarge'
}
],
'Ec2KeyName': 'dataengineer',
'Ec2SubnetId': 'mysubnet',
'EmrManagedMasterSecurityGroup': 'mysg',
'EmrManagedSlaveSecurityGroup': 'mysg',
'KeepJobFlowAliveWhenNoSteps': True,
},
Configurations=[
{
'Classification': 'hive-site',
'Properties': {
'hive.metastore.client.factory.class': 'com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory'}
},
{
'Classification': 'spark-hive-site',
'Properties': {
'hive.metastore.client.factory.class': 'com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory'}
},
{
"Classification": "spark-env",
"Configurations": [
{
"Classification": "export",
"Properties": {
"PYSPARK_PYTHON": "/usr/bin/python3",
"PYSPARK_DRIVER_PYTHON": "/usr/bin/python3",
"PYTHONIOENCODING": "utf8"
}
}
]
},
{
"Classification": "spark-defaults",
"Properties": {
"spark.sql.execution.arrow.enabled": "true"
}
},
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "true"
}
}
],
BootstrapActions=[
{
'Name': 'Bootstrap action',
'ScriptBootstrapAction': {
'Path': 's3://mybucket/bootstrap_emr.sh',
'Args': []
}
},
],
Steps=[
{
'Name': 'Setup Hadoop Debugging',
'ActionOnFailure': 'TERMINATE_CLUSTER',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['state-pusher-script']
}
}
],
JobFlowRole='EMR_EC2_DefaultRole',
ServiceRole='EMR_DefaultRole',
ScaleDownBehavior='TERMINATE_AT_TASK_COMPLETION',
VisibleToAllUsers=True
)
我们使用上述文档中的示例启动 pyspark shell:
pyspark \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--conf "spark.sql.hive.convertMetastoreParquet=false" \
--jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar
然后,在 shell 内部,当我们运行时,spark.sql("show tables")
我们会收到以下错误:
Using Python version 3.7.9 (default, Aug 27 2020 21:59:41)
SparkSession available as 'spark'.
>>> spark.sql("show tables").show()
21/04/09 19:49:21 WARN HiveConf: HiveConf of name hive.server2.thrift.url does not exist
Traceback (most recent call last):
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o67.sql.
: org.apache.spark.sql.AnalysisException: java.lang.NoSuchMethodError: org.apache.http.conn.ssl.SSLConnectionSocketFactory.<init>(Ljavax/net/ssl/SSLContext;Ljavax/net/ssl/HostnameVerifier;)V;
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:128)
at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:238)
at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:117)
...
... 44 more
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/session.py", line 767, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'java.lang.NoSuchMethodError: org.apache.http.conn.ssl.SSLConnectionSocketFactory.<init>(Ljavax/net/ssl/SSLContext;Ljavax/net/ssl/HostnameVerifier;)V;'
我们还尝试将其作为一个步骤提交deploy-mode
client
,cluster
并得到了类似的结果。
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o66.sql.
: org.apache.spark.sql.AnalysisException: java.lang.NoSuchMethodError: org.apache.http.conn.ssl.SSLConnectionSocketFactory.<init>(Ljavax/net/ssl/SSLContext;Ljavax/net/ssl/HostnameVerifier;)V;
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:128)
at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:238)
at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:117)
...
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:239)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
... 97 more
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/mnt/tmp/spark-cd3c9851-5edd-4c6f-abf8-2f4a0af807e3/spark_sql_test.py", line 7, in <module>
_ = spark.sql("select * from dl_raw.pagnet_clients limit 10")
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 767, in sql
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: 'java.lang.NoSuchMethodError: org.apache.http.conn.ssl.SSLConnectionSocketFactory.<init>(Ljavax/net/ssl/SSLContext;Ljavax/net/ssl/HostnameVerifier;)V;'