hadoop - Copying file from s3:// to local file system

Question

I am an aws newbie. I created a cluster and ssh'ed into the master node. When I am trying to copy files from s3://my-bucket-name/ to local file://home/hadoop folder in pig using:

cp s3://my-bucket-name/path/to/file file://home/hadoop

i get the error:

2013-06-08 18:59:00,267 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 29 99: Unexpected internal error. AWS Access Key ID and Secret Access Key must be s pecified as the username or password (respectively) of a s3 URL, or by setting t he fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).

I can not even ls into my s3 bucket. I set the AWS_ACCESS_KEY and AWS_SECRET_KEY without success. Also I could not locate config file for pig to set the appropriate fields.

Any help please?

Edit: I tried to load file in pig using the full s3n:// uri

grunt> raw_logs = LOAD 's3://XXXXX/input/access_log_1' USING TextLoader a
s (line:chararray);
grunt> illustrate raw_logs;

and I get the following error:

2013-06-08 19:28:33,342 [main] INFO org.apache.pig.backend.hadoop.executionengi ne.HExecutionEngine - Connecting to hadoop file system at: file:/// 2013-06-08 19:28:33,404 [main] INFO org.apache.pig.backend.hadoop.executionengi ne.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? fal se 2013-06-08 19:28:33,404 [main] INFO org.apache.pig.backend.hadoop.executionengi ne.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1 2013-06-08 19:28:33,405 [main] INFO org.apache.pig.backend.hadoop.executionengi ne.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1 2013-06-08 19:28:33,405 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job 2013-06-08 19:28:33,429 [main] INFO org.apache.pig.backend.hadoop.executionengi ne.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percen t is not set, set to default 0.3 2013-06-08 19:28:33,430 [main] ERROR org.apache.pig.pen.ExampleGenerator - Error reading data. Internal error creating job configuration. java.lang.RuntimeException: Internal error creating job configuration. at org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java :160) at org.apache.pig.PigServer.getExamples(PigServer.java:1244) at org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser. java:722) at org.apache.pig.tools.pigscript.parser.PigScriptParser.Illustrate(PigS criptParser.java:591) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScript Parser.java:306) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.j ava:189) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.j ava:165) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69) at org.apache.pig.Main.run(Main.java:500) at org.apache.pig.Main.main(Main.java:114) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces sorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:187) 2013-06-08 19:28:33,432 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 29 97: Encountered IOException. Exception : Internal error creating job configurati on. Details at logfile: /home/hadoop/pig_1370719069857.log

score 7 · Accepted Answer

First off, you should use the s3n protocol (unless you stored the files on s3 using the s3 protocol) - s3 is used for block storage (i.e. similar to hdfs, only on s3) and s3n is for native s3 file system (i.e. you get what you see there).

You can use distcp or a simple pig load from s3n. You can either supply the access & secret in hadoop-site.xml as specified in the exception you got (see here for more info: http://wiki.apache.org/hadoop/AmazonS3), or you can add them to the uri:

raw_logs = LOAD 's3n://access:secret@XXXXX/input/access_log_1' USING TextLoader AS (line:chararray);

Make sure that your secret doesn't contain back-slashes - otherwise it won't work.

score 4 · Accepted Answer

cp in

cp s3://my-bucket-name/path/to/file file://home/hadoop

is unaware of S3.

You may want to use:

s3cmd get s3://some-s3-bucket/some-s3-folder/local_file.ext ~/local_dir/

Not sure why s3cmd cp ... does not do what it needs to do, but s3cmd get ... works. And man s3cmd has:

   s3cmd get s3://BUCKET/OBJECT LOCAL_FILE
          Get file from bucket

score 1 · Accepted Answer

I experienced this exact same error and finally hit on the solution. However, I changed two things at once so I am not sure if both are required (certainly one of them is).

First, I made sure my S3 data and my EMR system were in the same region. When I had this problem, my data was in US East and the EMR was in US West. I standardized on US East (Virginia), a.k.a. us-east-1, a.k.a US Standard, a.k.a. DEFAULT, a.k.a. N. Virginia. This may not have been required, but it did not hurt.

Second, when I got the error, I started pig by following the steps in one of the videos and gave it a "-x local" option. It turns out that the "-x local" seems guaranteed to prevent the access to s3 (see below).

The solution is start pig with no parameters.

I hope this helps.

Gil

hadoop@domU-12-31-39-09-24-66:~$ pig -x local
2013-07-03 00:27:15,321 [main] INFO  org.apache.pig.Main - Apache Pig version 0.11.1-amzn (rexported) compiled Jun 24 2013, 18:37:44
2013-07-03 00:27:15,321 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/hadoop/pig_1372811235317.log
2013-07-03 00:27:15,379 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/hadoop/.pigbootup not found
2013-07-03 00:27:15,793 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - 
Connecting to hadoop file system at: file:///

grunt>  ls s3://xxxxxx.xx.rawdata
2013-07-03 00:27:23,463 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. AWS Access Key ID and
Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId
or fs.s3.awsSecretAccessKey properties (respectively).
Details at logfile: /home/hadoop/pig_1372811235317.log

grunt> quit

hadoop@domU-12-31-39-09-24-66:~$ pig
2013-07-03 00:28:04,769 [main] INFO  org.apache.pig.Main - Apache Pig version 0.11.1-amzn (rexported) compiled Jun 24 2013, 18:37:44
2013-07-03 00:28:04,771 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/hadoop/pig_1372811284764.log
2013-07-03 00:28:04,873 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/hadoop/.pigbootup not found
2013-07-03 00:28:05,639 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - 
Connecting to hadoop file system at: hdfs://10.210.43.148:9000
2013-07-03 00:28:08,765 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: 10.210.43.148:9001

grunt>  ls s3://xxxxxx.xx.rawdata
s3://xxxxxx.xx.rawdata/rawdata<r 1>  19813
s3://xxxxxx.xx.rawdata/rawdata.csv<r 1> 19813
grunt>

hadoop - Copying file from s3:// to local file system

3 回答 3

Related

Reference