我正在尝试从 s3 读取并写入 Elasticsearch,在 spark master 机器上使用 jupyter install
我有这个配置:
import pyspark
import os
#os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell"
import findspark
findspark.init()
from pyspark.sql import SparkSession
import configparser
config = configparser.ConfigParser()
config.read(os.path.expanduser("~/.aws/credentials"))
aws_profile='DEFAULT'
access_id = config.get(aws_profile, "aws_access_key_id")
access_key = config.get(aws_profile, "aws_secret_access_key")
from pyspark import SparkContext, SparkConf
sc_conf = SparkConf()
sc_conf.setAppName("app-3-logstash")
sc_conf.setMaster('spark://172.31.25.152:7077')
sc_conf.set('spark.executor.memory', '24g')
sc_conf.set('spark.executor.cores', '8')
sc_conf.set('spark.cores.max', '32')
sc_conf.set('spark.logConf', True)
sc_conf.set('spark.packages', 'org.apache.hadoop:hadoop-aws:2.7.3')
sc_conf.set('spark.jars', '/usr/local/spark/jars/elasticsearch-hadoop-7.6.0/dist/elasticsearch-spark-20_2.11-7.6.0.jar')
sc = SparkContext(conf=sc_conf)
hadoop_conf=sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoop_conf.set("fs.s3n.awsAccessKeyId", access_id)
hadoop_conf.set("fs.s3n.awsSecretAccessKey", access_key)
使用此配置,当尝试使用此配置从 s3 读取时,我可以访问 ES 而不是 S3,我收到此错误:
Py4JJavaError:调用 z:org.apache.spark.api.python.PythonRDD.collectAndServe 时出错。:java.lang.RuntimeException:java.lang.ClassNotFoundException:类org.apache.hadoop.fs.s3native.NativeS3FileSystem找不到
当禁用 sc_conf.set('spark.packages'.. 和 sc_conf.set('spark.jars', .. 并启用 #os.environ['PYSPARK_SUBMIT_ARGS'] 时,它确实可以访问 s3 但不能访问 ES
我想念什么?
谢谢亚尼夫