authentication - 通过 Spark 本地读取 S3 文件（或更好：pyspark）

Question

score 9 · Accepted Answer

Yes, you have to use s3n instead of s3. s3 is some weird abuse of S3 the benefits of which are unclear to me.

You can pass the credentials to the sc.hadoopFile or sc.newAPIHadoopFile calls:

rdd = sc.hadoopFile('s3n://my_bucket/my_file', conf = {
  'fs.s3n.awsAccessKeyId': '...',
  'fs.s3n.awsSecretAccessKey': '...',
})

score 3 · Accepted Answer

The problem was actually a bug in the Amazon's boto Python module. The problem was related to the fact that MacPort's version is actually old: installing boto through pip solved the problem: ~/.aws/credentials was correctly read.

Now that I have more experience, I would say that in general (as of the end of 2015) Amazon Web Services tools and Spark/PySpark have a patchy documentation and can have some serious bugs that are very easy to run into. For the first problem, I would recommend to first update the aws command line interface, boto and Spark every time something strange happens: this has "magically" solved a few issues already for me.

score 3 · Accepted Answer

Here is a solution on how to read the credentials from ~/.aws/credentials. It makes use of the fact that the credentials file is an INI file which can be parsed with Python's configparser.

import os
import configparser

config = configparser.ConfigParser()
config.read(os.path.expanduser("~/.aws/credentials"))

aws_profile = 'default' # your AWS profile to use

access_id = config.get(aws_profile, "aws_access_key_id") 
access_key = config.get(aws_profile, "aws_secret_access_key")

See also my gist at https://gist.github.com/asmaier/5768c7cda3620901440a62248614bbd0 .

score 1 · Accepted Answer

Environment variables setup could help.

Here in Spark FAQ under the question "How can I access data in S3?" they suggest to set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.

score 0 · Accepted Answer

I cannot say much about the java objects you have to give to the hadoopFile function, only that this function already seems depricated for some "newAPIHadoopFile". The documentation on this is quite sketchy and I feel like you need to know Scala/Java to really get to the bottom of what everything means. In the mean time, I figured out how to actually get some s3 data into pyspark and I thought I would share my findings. This documentation: Spark API documentation says that it uses a dict that gets converted into a java configuration (XML). I found the configuration for java, this should probably reflect the values you should put into the dict: How to access S3/S3n from local hadoop installation

bucket = "mycompany-mydata-bucket"
prefix = "2015/04/04/mybiglogfile.log.gz"
filename = "s3n://{}/{}".format(bucket, prefix)

config_dict = {"fs.s3n.awsAccessKeyId":"FOOBAR",
               "fs.s3n.awsSecretAccessKey":"BARFOO"}

rdd = sc.hadoopFile(filename,
                    'org.apache.hadoop.mapred.TextInputFormat',
                    'org.apache.hadoop.io.Text',
                    'org.apache.hadoop.io.LongWritable',
                    conf=config_dict)

This code snippet loads the file from the bucket and prefix (file path in the bucket) specified on the first two lines.

authentication - 通过 Spark 本地读取 S3 文件（或更好：pyspark）

5 回答 5

Related

Reference