我的 hadoop 工作卡在最后一个 reduce 任务上。我已经看到很多从 3 个不同主机到单个主机的连接超时。但是,我可以从任何其他机器上 ping 这台麻烦的机器。
这是一个 5 节点集群。它是最近建成的。它们具有相同的 hadoop 和 pig 二进制文件。它有3台新机器和2台旧机器。如果我删除了 2 台旧机器,它工作正常。
有问题的旧机器版本:
java version "1.6.0_20"
OpenJDK Runtime Environment (IcedTea6 1.9.13) (6b20-1.9.13-0ubuntu1~10.10.1)
OpenJDK 64-Bit Server VM (build 19.0-b09, mixed mode)
新工作机版本:
java version "1.6.0_24"
OpenJDK Runtime Environment (IcedTea6 1.11.3) (6b24-1.11.3-1ubuntu0.12.04.1)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)
我还从日志中看到了大量的 dup 主机(180 个)。
2012-08-28 12:34:53,307 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 1 is already in progress
2012-08-28 12:34:53,307 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (0 slow hosts and3 dup hosts)
2012-08-28 12:35:53,308 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 1 is already in progress
2012-08-28 12:35:53,308 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (0 slow hosts and3 dup hosts)
2012-08-28 12:36:53,309 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 1 is already in progress
2012-08-28 12:36:53,309 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (0 slow hosts and3 dup hosts)
2012-08-28 12:37:18,416 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 copy failed: attempt_201208231557_0021_m_000002_0 from Han
2012-08-28 12:37:18,417 WARN org.apache.hadoop.mapred.ReduceTask: java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:327)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:193)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:180)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
at java.net.Socket.connect(Socket.java:546)
at sun.net.NetworkClient.doConnect(NetworkClient.java:173)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:409)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:530)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:240)
at sun.net.www.http.HttpClient.New(HttpClient.java:321)
at sun.net.www.http.HttpClient.New(HttpClient.java:338)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:935)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:876)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:801)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1618)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1575)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1483)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1394)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1326)
2012-08-28 12:37:18,417 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_201208231557_0021_r_000008_0: Failed fetch #6 from attempt_201208231557_0021_m_000002_0
2012-08-28 12:37:18,417 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 adding host Han to penalty box, next contact in 48 seconds
2012-08-28 12:37:18,417 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0: Got 1 map-outputs from previous failures
2012-08-28 12:37:53,418 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 0 is already in progress
2012-08-28 12:37:53,418 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (1 slow hosts and0 dup hosts)
2012-08-28 12:37:53,418 INFO org.apache.hadoop.mapred.ReduceTask: Penalized(slow) Hosts:
2012-08-28 12:37:53,418 INFO org.apache.hadoop.mapred.ReduceTask: Han Will be considered after: 13 seconds.
2012-08-28 12:38:08,418 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts)
2012-08-28 12:38:53,419 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 1 is already in progress
2012-08-28 12:38:53,419 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (0 slow hosts and3 dup hosts)
2012-08-28 12:39:53,420 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 1 is already in progress
2012-08-28 12:39:53,420 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (0 slow hosts and3 dup hosts)
2012-08-28 12:40:53,421 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 1 is already in progress
2012-08-28 12:40:53,421 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (0 slow hosts and3 dup hosts)
2012-08-28 12:41:08,537 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 copy failed: attempt_201208231557_0021_m_000002_0 from Han
2012-08-28 12:41:08,538 WARN org.apache.hadoop.mapred.ReduceTask: java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:327)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:193)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:180)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
at java.net.Socket.connect(Socket.java:546)
at sun.net.NetworkClient.doConnect(NetworkClient.java:173)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:409)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:530)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:240)
at sun.net.www.http.HttpClient.New(HttpClient.java:321)
at sun.net.www.http.HttpClient.New(HttpClient.java:338)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:935)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:876)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:801)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1618)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1575)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1483)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1394)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1326)
2012-08-28 12:41:08,538 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_201208231557_0021_r_000008_0: Failed fetch #7 from attempt_201208231557_0021_m_000002_0
2012-08-28 12:41:08,538 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 adding host Han to penalty box, next contact in 62 seconds
2012-08-28 12:41:08,538 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0: Got 1 map-outputs from previous failures
2012-08-28 12:41:53,539 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 0 is already in progress
2012-08-28 12:41:53,539 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (1 slow hosts and0 dup hosts)
2012-08-28 12:41:53,539 INFO org.apache.hadoop.mapred.ReduceTask: Penalized(slow) Hosts:
2012-08-28 12:41:53,539 INFO org.apache.hadoop.mapred.ReduceTask: Han Will be considered after: 17 seconds.
2012-08-28 12:42:13,540 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts)
2012-08-28 12:42:53,540 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 1 is already in progress
2012-08-28 12:42:53,540 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (0 slow hosts and3 dup hosts)
更多日志:
2012-09-06 11:01:47,719 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201209041720_0010_r_000014_0 1.0% reduce > reduce
2012-09-06 11:01:47,720 INFO org.apache.hadoop.mapred.TaskTracker: Task attempt_201209041720_0010_r_000014_0 is done.
2012-09-06 11:01:47,720 INFO org.apache.hadoop.mapred.TaskTracker: reported output size for attempt_201209041720_0010_r_000014_0 was -1
2012-09-06 11:01:47,720 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot : current free slots : 6
2012-09-06 11:01:47,769 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201209041720_0010_r_1617080101 exited with exit code 0. Number of tasks it ran: 1
2012-09-06 11:01:47,913 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201209041720_0010_r_-434797671 exited with exit code 0. Number of tasks it ran: 1
2012-09-06 11:01:50,428 INFO org.apache.hadoop.mapred.TaskTracker: Received KillTaskAction for task: attempt_201209041720_0010_r_000014_0
2012-09-06 11:01:50,428 INFO org.apache.hadoop.mapred.TaskTracker: About to purge task: attempt_201209041720_0010_r_000014_0
2012-09-06 11:01:50,429 INFO org.apache.hadoop.mapred.TaskTracker: Received KillTaskAction for task: attempt_201209041720_0010_r_000020_0
2012-09-06 11:01:50,429 INFO org.apache.hadoop.mapred.TaskTracker: About to purge task: attempt_201209041720_0010_r_000020_0