0

我在 hdfs 的一个文件夹中有许多文件,它们的名称在 hdfs 中的格式为 filename.yyyy-mm-dd-hh.machinename.gz。我需要将这些转移到 s3 但我想将它们放在 yyyy/mm/dd/hh/filename.yyyy-mm-dd-hh.machinename.gz (这将是对象名称,因为 s3 具有平面结构)下我指定的存储桶。distcp 命令可以将文件从 hdfs 传输到 s3,但是否有执行上述操作的选项?如果不是,我该如何扩展 distcp 来执行此操作?

4

1 回答 1

1

Note that: This is not a solution, but just a hint.

I don't know the exact answer and also don't have a S3 instance to actually try on it. But here is AWK way of pre-processing the file names and copying files inside a specific directory structure. The command below is written considering local Linux file system:

Initial directory content:

user@machine:~/path/to/input$ find
 ./filename.yyyy-mm-dd-hh.machinename.gz
 ./filename.2016-12-10-08.machinename.gz
 ./filename.2015-12-10-08.machinename.gz
 ./filename.2015-10-10-08.machinename.gz
 ./filename.2015-10-11-08.machinename.gz

Command for copying files inside a specific directory structure:

user@machine:~/path/to/input$ ls | awk -F"." '{print $1" "$2" "$3" "$4}' | awk -F"-" '{print $1" "$2" "$3" "$4" "$5" "$6" "$7}' | awk -F" " '{PATH=$2"/"$3"/"$4"/"$5; FNAME=$1"."$2"-"$3"-"$4"-"$5"."$6"."$7; system("mkdir -p "PATH); system("cp "FNAME" "PATH); }'

Final directory content after command execution:

 ./filename.yyyy-mm-dd-hh.machinename.gz
 ./yyyy
 ./yyyy/mm
 ./yyyy/mm/dd
 ./yyyy/mm/dd/hh
 ./yyyy/mm/dd/hh/filename.yyyy-mm-dd-hh.machinename.gz

 ./filename.2016-12-10-08.machinename.gz
 ./2016
 ./2016/12
 ./2016/12/10
 ./2016/12/10/08
 ./2016/12/10/08/filename.2016-12-10-08.machinename.gz

 ./filename.2015-12-10-08.machinename.gz
 ./2015
 ./2015/12
 ./2015/12/10
 ./2015/12/10/08
 ./2015/12/10/08/filename.2015-12-10-08.machinename.gz

 ./filename.2015-10-11-08.machinename.gz
 ./2015/10
 ./2015/10/11
 ./2015/10/11/08
 ./2015/10/11/08/filename.2015-10-11-08.machinename.gz

 ./filename.2015-10-10-08.machinename.gz
 ./2015/10/10
 ./2015/10/10/08
 ./2015/10/10/08/filename.2015-10-10-08.machinename.gz
于 2016-04-12T10:09:16.137 回答