我看到人们将 EMR 输出写入 HDFS 的示例,但我无法找到它是如何完成的示例。最重要的是,该文档似乎说 EMR 流作业的 --output 参数必须是 S3 存储桶。
当我实际尝试运行脚本时(在本例中,使用 python 流和 mrJob),它会引发“Invalid S3 URI”错误。
这是命令:
python my_script.py -r emr \
--emr-job-flow-id=j-JOBID --conf-path=./mrjob.conf --no-output \
--output hdfs:///my-output \
hdfs:///my-input-directory/my-files*.gz
和追溯...
Traceback (most recent call last):
File "pipes/sampler.py", line 28, in <module>
SamplerJob.run()
File "/Library/Python/2.7/site-packages/mrjob/job.py", line 483, in run
mr_job.execute()
File "/Library/Python/2.7/site-packages/mrjob/job.py", line 501, in execute
super(MRJob, self).execute()
File "/Library/Python/2.7/site-packages/mrjob/launch.py", line 146, in execute
self.run_job()
File "/Library/Python/2.7/site-packages/mrjob/launch.py", line 206, in run_job
with self.make_runner() as runner:
File "/Library/Python/2.7/site-packages/mrjob/job.py", line 524, in make_runner
return super(MRJob, self).make_runner()
File "/Library/Python/2.7/site-packages/mrjob/launch.py", line 161, in make_runner
return EMRJobRunner(**self.emr_job_runner_kwargs())
File "/Library/Python/2.7/site-packages/mrjob/emr.py", line 585, in __init__
self._output_dir = self._check_and_fix_s3_dir(self._output_dir)
File "/Library/Python/2.7/site-packages/mrjob/emr.py", line 776, in _check_and_fix_s3_dir
raise ValueError('Invalid S3 URI: %r' % s3_uri)
ValueError: Invalid S3 URI: 'hdfs:///input/sample'
如何将 EMR 流作业的输出写入 HDFS?甚至可能吗?