在 Amazon 数据管道中,我正在创建活动以使用 Hive 将 S3 复制到 EMR。为了实现它,我必须将两个输入参数作为一个步骤传递给 EMR 作业。我搜索了几乎所有数据管道文档,但没有找到指定多个输入参数的方法。我还与 AWS 支持团队进行了交谈,但他们也不清楚。他们建议的方式/技巧也不起作用。
下面是我的步骤参数和 Hive 查询。如果有人有实现它的想法,请告诉我。
脚步:
s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar,s3://us-east-1.elasticmapreduce/libs/hive/hive-script,--base-path,s3://us-east-1.elasticmapreduce/libs/hive/,--hive-versions,latest,--run-hive-script,--args,-f,s3://gwbpipeline-test/scripts/multiple_user_sample_new.hql, -d, "output1=#{output.directoryPath}", -d,"input1=s3://gwbpipeline-test/temp/sb-test/#{format(@scheduledStartTime,'YYYY-MM-dd hh-mm-ss')}/input/tbl_users/", -d,"input2=s3://gwbpipeline-test/temp/sb-test/#{format(@scheduledStartTime,'YYYY-MM-dd hh-mm-ss')}/input/tbl_user_children/"
蜂巢查询:
drop table if exists tbl_users;
CREATE EXTERNAL TABLE tbl_users (
user_id string, user_first_name string, user_last_name string, user_email string, user_dob string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '${input1}';
drop table if exists tbl_user_children;
CREATE EXTERNAL TABLE tbl_user_children (
id string, full_name string, birthday string, type string, user_id string, facebook_id string, date_added string
)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '${input2}';
drop table if exists tbl_users_child_output;
CREATE EXTERNAL TABLE userS3output (
user_id string, user_fname string, user_lname string, child_full_name string, child_dirthdate string )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '${output1}';
INSERT INTO TABLE tbl_users_child_output SELECT u.user_id, u.user_first_name, u.user_last_name, c.full_name, c.birthday FROM tbl_users as u join tbl_user_children as c ON u.user_id = c.user_id;