我正在使用 MrJob 并尝试在 Elastic Map Reduce 上运行 Hadoop 作业,该作业不断随机崩溃。
数据如下所示(制表符分隔):
279391888 261151291 107.303163 35.468534
279391888 261115099 108.511726 35.503008
279391888 261151290 104.881560 35.278487
279391888 261151292 109.732004 35.659141
279391888 261266862 108.507754 35.434581
279391888 1687590146 59.118796 19.931201
279391888 269450882 58.909985 19.914108
而且底层的 MapReduce 非常简单:
from mrjob.job import MRJob
import numpy as np
class CitypathsSummarize(MRJob):
def mapper(self, _, line):
orig, dest, minutes, dist = line.split()
minutes = float(minutes)
dist = float(dist)
if minutes < .001:
yield "crap", 1
else:
yield orig, dist/minutes
def reducer(self, orig, speeds):
speeds = list(speeds)
mean = np.mean(speeds)
yield orig, mean
if __name__ == "__main__":
CitypathsSummarize.run()
当我运行它时,我使用以下命令,使用默认的 mrjob.conf(我的密钥是在环境中设置的):
$ python summarize.py -r emr --ec2-instance-type c1.xlarge --num-ec2-instances 4 s3://citypaths/chicago-v4/ > chicago-v4-output.txt
当我在小型数据集上运行它时,它完成得很好。当我在整个数据语料库(价值约 10GiB)上运行它时,我会收到这样的错误(但不是每次都在同一点!):
Probable cause of failure (from s3://mrjob-093c9ef589d9f262/tmp/logs/j-KCPTKZR5OX6D/task-attempts/attempt_201301211911_0001_m_000151_3/syslog):
java.io.FileNotFoundException: /mnt2/var/lib/hadoop/mapred/taskTracker/jobcache/job_201301211911_0001/attempt_201301211911_0001_m_000018_4/output/spill0.out
(while reading from s3://citypaths/chicago-v4/1613640660)
Terminating job flow: j-KCPTKZR5OX6D
Traceback (most recent call last):
File "summarize.py", line 32, in <module>
CitypathsSummarize.run()
File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 545, in run
mr_job.execute()
File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 561, in execute
self.run_job()
File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 631, in run_job
runner.run()
File "/usr/local/lib/python2.7/dist-packages/mrjob/runner.py", line 490, in run
self._run()
File "/usr/local/lib/python2.7/dist-packages/mrjob/emr.py", line 1048, in _run
self._wait_for_job_to_complete()
File "/usr/local/lib/python2.7/dist-packages/mrjob/emr.py", line 1830, in _wait_for_job_to_complete
raise Exception(msg)
Exception: Job on job flow j-KCPTKZR5OX6D failed with status SHUTTING_DOWN: Shut down as step failed
Probable cause of failure (from s3://mrjob-093c9ef589d9f262/tmp/logs/j-KCPTKZR5OX6D/task-attempts/attempt_201301211911_0001_m_000151_3/syslog):
java.io.FileNotFoundException: /mnt2/var/lib/hadoop/mapred/taskTracker/jobcache/job_201301211911_0001/attempt_201301211911_0001_m_000018_4/output/spill0.out
(while reading from s3://citypaths/chicago-v4/1613640660)
我已经运行了两次;第一次在 45 分钟后死亡,这次在 4 小时后死亡。它两次都死在不同的文件上。我检查了它死掉的两个文件,都没有任何问题。
不知何故,它无法找到它写入的溢出文件,这让我感到困惑。
编辑:
我再次运行该工作,几个小时后它又死了,这次出现了不同的错误消息。
Probable cause of failure (from s3://mrjob-093c9ef589d9f262/tmp/logs/j-3GGW2TSIKKW5R/task-attempts/attempt_201301310511_0001_m_001810_0/syslog):
Status Code: 403, AWS Request ID: 9E9E748A55BC6A58, AWS Error Code: RequestTimeTooSkewed, AWS Error Message: The difference between the request time and the current time is too large., S3 Extended Request ID: Ky+HVYZ8RsC3l5f9N3LTwyorY9bbqEnc4tRD/r/xfAHYP/eiQrjjcpmIDNY2eoDo
(while reading from s3://citypaths/chicago-v4/1439606131)