2

我希望你能帮助我。我正在尝试使用数据管道安装 hadoop 和 spark 创建 EMR 集群。问题是这个 EMR 是私有的,所以它不能访问互联网来下载任何东西。在管道中,我指示引导操作以下载所有 .jar 和依赖项,包括 TaskRunner.jar。

管道的 EMRActivity 是启动 script.py

{
      "name": "DefaultEmrActivity1",
      "maximumRetries" : 0,
      "runsOn": {
        "ref": "EmrClusterId_lKm9y"
      },
      "id": "EmrActivityId_SRjHg",
      "type": "ShellCommandActivity",
      "command": "spark-submit --deploy-mode cluster --conf spark.yarn.submit.waitAppCompletion=true --py-files s3://emr/script.py"
    },

但是这个步骤没有在我的 EMR 集群中运行。相反,我看到“安装 TaskRunner”步骤试图从互联网安装 jar,所以它失败了。

taskRunner 步骤命令:

JAR location :s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar
Main class :None
Arguments :s3://datapipeline-eu-west-1/eu-west-1/bootstrap-actions/latest/TaskRunner/install-remote-runner-v2   
--workerGroup=df-08684532KKW88TTUXHVS_@EmrClusterId_lKm9y_2021-05-07T07:22:56   
--endpoint=https://datapipeline.eu-west-1.amazonaws.com --region=eu-west-1   
--logUri=s3://aws-logs-351516419540-eu-west-1/pipeline/df-08684532KKW88TTUXHVS/EmrClusterId_lKm9y/@EmrClusterId_lKm9y_2021-05-07T07:22:56/@EmrClusterId_lKm9y_2021-05-07T07:22:56_Attempt=1/ --taskRunnerId=54ec5b53-884b-420d-b3e6-d0e518ddf448   
--zipFile=http://datapipeline-eu-west-1.s3.amazonaws.com/eu-west-1/software/latest/TaskRunner/TaskRunner-1.0.zip   
--mysqlFile=http://datapipeline-eu-west-1.s3.amazonaws.com/eu-west-1/software/latest/TaskRunner/mysql-connector-java-bin.jar   
--hiveCsvSerdeFile=http://datapipeline-eu-west-1.s3.amazonaws.com/eu-west-1/software/latest/TaskRunner/csv-serde.jar   
--proxyHost= --proxyPort=-1 --username= --password= --windowsDomain= --windowsWorkgroup= --releaseLabel=emr-6.2.0   
--jdbcDriverS3Path=s3://datapipeline-eu-west-1/eu-west-1/software/latest/TaskRunner/ --s3NoProxy=false
Action on failure:Terminate cluster

错误:

Connecting to datapipeline-eu-west-1.s3.amazonaws.com (datapipeline-eu-west-1.s3.amazonaws.com)|52.218.108.50|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16873 (16K) [application/octet-stream]
Saving to: ‘common/csv-serde.jar’

     0K .......... ......                                     100% 26.7M=0.001s

2021-05-07 07:30:44 (26.7 MB/s) - ‘common/csv-serde.jar’ saved [16873/16873]

+ '[' -n emr-6.2.0 ']'
+ sudo echo -e '\nexport HADOOP_CLASSPATH="$HADOOP_CLASSPATH:/mnt/taskRunner/common/mysql-connector-java-bin.jar:/etc/hadoop/hive/lib/hive-exec.jar"'
+ sudo tee -a /etc/hadoop/conf/hadoop-env.sh
+ bash /etc/hadoop/conf/hadoop-env.sh
+ '[' -z emr-6.2.0 ']'
+ unzip -o taskRunner.zip
+ chmod 500 aws-datapipeline-taskrunner-v2.sh
+ '[' -d /usr/share/aws/emr/goodies/lib ']'
+ '[' -n emr-6.2.0 ']'
+ EMR_HADOOP_GOODIES_NAME='emr-hadoop-goodies-*jar'
+ EMR_HIVE_GOODIES_NAME='emr-hive-goodies-*jar'
+ OPEN_CSV_PATH=/usr/lib/hive/lib/
++ find /usr/share/aws/emr/goodies/lib -name 'emr-hadoop-goodies-*jar'
+ emr_goodies_jar=/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies-3.1.0.jar
+ '[' -n /usr/share/aws/emr/goodies/lib/emr-hadoop-goodies-3.1.0.jar ']'
+ open_csv_symlink=/mnt/taskRunner/open-csv.jar
+ emr_goodies_symlink=/mnt/taskRunner/oncluster-emr-hadoop-goodies.jar
+ emr_hive_goodies_symlink=/mnt/taskRunner/oncluster-emr-hive-goodies.jar
+ sudo rm -f /mnt/taskRunner/open-csv.jar
+ sudo rm -f /mnt/taskRunner/oncluster-emr-hadoop-goodies.jar
+ sudo rm -f /mnt/taskRunner/oncluster-emr-hive-goodies.jar
++ find /usr/share/aws/emr/goodies/lib -name 'emr-hive-goodies-*jar'
+ emr_hive_jar=/usr/share/aws/emr/goodies/lib/emr-hive-goodies-3.1.0.jar
++ find /usr/lib/hive/lib/ -name 'opencsv*jar'
+ open_csv_jar='/usr/lib/hive/lib/opencsv-2.3.jar
/usr/lib/hive/lib/opencsv-3.9.jar'
+ sudo ln -s /usr/share/aws/emr/goodies/lib/emr-hadoop-goodies-3.1.0.jar /mnt/taskRunner/oncluster-emr-hadoop-goodies.jar
+ sudo ln -s /usr/share/aws/emr/goodies/lib/emr-hive-goodies-3.1.0.jar /mnt/taskRunner/oncluster-emr-hive-goodies.jar
+ sudo ln -s /usr/lib/hive/lib/opencsv-2.3.jar /usr/lib/hive/lib/opencsv-3.9.jar /mnt/taskRunner/open-csv.jar
ln: target ‘/mnt/taskRunner/open-csv.jar’ is not a directory
Command exiting with ret '1'

我不知道为什么无法创建链接,因为 EMR 在步骤失败中终止并且我无法检查它。
但我不希望执行此步骤,因为这些 jar 将安装在引导程序中。关于如何避免此步骤运行的任何建议?谢谢

4

2 回答 2

1

当使用EmrCluster 资源创建 Data Pipeline 时,它​​将启动具有预定义配置的集群,并自动运行一个步骤来安装和运行 Task Runner(参考)。

我在运行安装 Task Runner 的步骤时遇到了该错误。您可以先创建一个 EMR 集群,在其上安装并运行 Task Runner,然后在创建数据管道时使用EmrActivityworkerGroup中的参数将集群与数据管道关联。这对我有用。此处提供了解释如何执行此操作的答案。文档可在此处获得。

于 2022-01-07T12:04:43.027 回答
1

如果你看一下 open_csv_jar env var (open_csv_jar='/usr/lib/hive/lib/opencsv-2.3.jar /usr/lib/hive/lib/opencsv-3.9.jar') 你会发现它有两个版本。我不知道为什么,但是如果您尝试 emr 0.6.1.0,它就不会发生,并且集群配置可以完美运行。

于 2021-05-10T21:38:04.903 回答