I am trying to run s3distcp as to merge a lot of small (200-600KB) files from S3 to HDFS.
I am running Hadoop on CDH 4.2 over Ubuntu.
To be specific: Hadoop 2.0.0-cdh4.2.0 Subversion file:///var/lib/jenkins/workspace/generic-package-ubuntu64-12-04/CDH4.2.0-Packaging-Hadoop-2013-02-15_10-38-54/hadoop-2.0.0+922-1.cdh4.2.0.p0.12~precise/src/hadoop-common-project/hadoop-common -r 8bce4bd28a464e0a92950c50ba01a9deb1d85686
I have previously solved all dependencies to aws-java-sdk-1.4.1.jar and s3distcp.jar, by copying them into the Hadoop classpath. Libsnappy1 is also installed.
But when I run:
hdfs@test-cdh-03-master:/home/ubuntu$ hadoop jar /usr/lib/hadoop/lib/s3distcp.jar --src 's3n://workdir-XXXX-YYYYlogs/production-YYYYYlogs/Log-FFFFFFF-click/' --dest 'hdfs:///test/' --groupBy 'Log-FFFFF(.*)'
I get the following error stack:
13/04/08 14:36:30 INFO s3distcp.S3DistCp: Using output path 'hdfs:/tmp/ab7c0a09-07ba-4592-b354-bcd0dd3d6a07/output'
13/04/08 14:36:36 INFO s3distcp.S3DistCp: Created 0 files to copy 0 files
13/04/08 14:36:36 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/04/08 14:36:37 INFO mapred.JobClient: Cleaning up the staging area hdfs://test-cdh-03-master.extc.test-cdh-03.adswizz.com/tmp/hadoop-temp/mapred/staging/hdfs/.staging/job_201304041515_0016
13/04/08 14:36:37 ERROR security.UserGroupInformation: PriviledgedActionException as:hdfs (auth:SIMPLE) cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs:/tmp/ab7c0a09-07ba-4592-b354-bcd0dd3d6a07/files
13/04/08 14:36:37 INFO s3distcp.S3DistCp: Try to recursively delete hdfs:/tmp/ab7c0a09-07ba-4592-b354-bcd0dd3d6a07/tempspace
Exception in thread "main" java.lang.RuntimeException: Error running job
at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:586)
at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:216)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at com.amazon.external.elasticmapreduce.s3distcp.Main.main(Main.java:12)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs:/tmp/ab7c0a09-07ba-4592-b354-bcd0dd3d6a07/files
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:194)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:205)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1091)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1083)
at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:993)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:946)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:946)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:920)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1369)
at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:568)
... 9 more
Is there something else I should try? Is there a problem I cannot see with the regex?