3

好吧,最近,在我正在运行的任何 Hadoop 流程中,我遇到某个地图节点(主节点作为从节点工作)的 3 分钟 10 秒延迟。在初始化延迟之后,它恢复正常并立即执行。

例如,在运行 QuasiMonteCarlo 示例时:

Task Id                                 Start Time  Finish Time <br>
attempt_201204101957_0006_m_000003_0    10/04 20:14:54  10/04 20:18:05 (3mins, 10sec)   /default-rack/master

2012-04-10 20:18:04,470 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library<br>2012-04-10 20:18:04,646 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=<br>
2012-04-10 20:18:04,647 WARN org.apache.hadoop.conf.Configuration: user.name is deprecated. Instead, use mapreduce.job.user.name<br>
2012-04-10 20:18:04,751 INFO org.apache.hadoop.mapreduce.util.ProcessTree: setsid exited with exit code 0<br>
2012-04-10 20:18:04,754 INFO org.apache.hadoop.mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.mapreduce.util.LinuxResourceCalculatorPlugin@79ee2c2c<br>
2012-04-10 20:18:04,912 INFO org.apache.hadoop.mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)<br>
2012-04-10 20:18:04,912 INFO org.apache.hadoop.mapred.MapTask: mapreduce.task.io.sort.mb: 100<br>
2012-04-10 20:18:04,912 INFO org.apache.hadoop.mapred.MapTask: soft limit at 83886080<br>
2012-04-10 20:18:04,912 INFO org.apache.hadoop.mapred.MapTask: bufstart = 0; bufvoid = 104857600<br>
2012-04-10 20:18<br>:04,912 INFO org.apache.hadoop.mapred.MapTask: kvstart = 26214396; length = 6553600
2012-04-10 20:18:04,939 INFO org.apache.hadoop.mapred.MapTask: Starting flush of map output<br>
2012-04-10 20:18:04,940 INFO org.apache.hadoop.mapred.MapTask: Spilling map output<br>
2012-04-10 20:18:04,940 INFO org.apache.hadoop.mapred.MapTask: bufstart = 0; bufend = 18; bufvoid = 104857600<br>
2012-04-10 20:18:04,940 INFO org.apache.hadoop.mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214392(104857568); length = 5/6553600<br>
2012-04-10 20:18:04,972 INFO org.apache.hadoop.mapred.MapTask: Finished spill 0<br>
2012-04-10 20:18:04,975 INFO org.apache.hadoop.mapred.Task: Task:attempt_201204101957_0006_m_000003_0 is done. And is in the process of commiting<br>
2012-04-10 20:18:05,058 INFO org.apache.hadoop.mapred.Task: Task 'attempt_201204101957_0006_m_000003_0' done.<br>

任务跟踪器日志更能说明问题:

2012-04-10 **20:14:54,615** INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current free slots : 1 and trying to launch attempt_201204101957_0006_m_000003_0 which needs 1 slots<br>
2012-04-10 20:14:54,685 INFO org.apache.hadoop.mapred.JvmManager: JVM Runner jvm_201204101957_0006_m_377512887 spawned.<br>
2012-04-10 20:16:34,041 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot : current free slots : 1<br>
2012-04-10 **20:18:04,433** INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201204101957_0006_m_377512887 given task: attempt_201204101957_0006_m_000003_0<br>
2012-04-10 20:18:04,938 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204101957_0006_m_000003_0 0.0% <br>
2012-04-10 20:18:05,056 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201204101957_0006_m_000003_0 0.667% Generated 1000 samples. <br>

排序
2012-04-10 20:18:05,058 INFO org.apache.hadoop.mapred.TaskTracker:任务尝试_201204101957_0006_m_000003_0 已完成。
2012-04-10 20:18:05,058 INFO org.apache.hadoop.mapred.TaskTracker:报告的尝试_201204101957_0006_m_000003_0 的输出大小为 28
2012-04-10 20:18:05,058 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot :当前可用插槽:2
2012-04-10 20:18:05,213 INFO org.apache.hadoop.mapreduce.util.ProcessTree:向进程组的所有成员发送信号 -23030:SIGTERM。退出代码 1
2012-04-10 20:18:08,478 INFO org.apache.hadoop.mapred.TaskTracker:发送 28 个字节以从地图中减少 0:给定 28/24 的尝试_201204101957_0006_m_000003_0
2012-04-10 20:18:08,478 INFO org.apache.hadoop.mapred.TaskTracker:洗牌 1maps (mapIds=attempt_201204101957_0006_m_000003_0) 在 29 秒内减少 0
2012-04-10 20:18:08,478 INFO org.apache.hadoop. mapred.TaskTracker.clienttrace: src: 147.102.7.173:50060, dest: 147.102.7.175:57289, maps: 1, op: MAPRED_SHUFFLE, reduceID: 0, duration: 29
2012-04-10 20:18:10,217 INFO org. apache.hadoop.mapred.JvmManager:JVM:jvm_201204101957_0006_m_377512887 退出,退出代码为 0。它运行的任务数:1

我怀疑这里有网络问题,但我可以毫无问题地 ping 和 ssh。

4

0 回答 0