我们的 Hadoop 集群遇到了问题。JobTracker 从其 UI 中默默退订了 10 个节点(大约 70 个)。
虽然这些节点曾经运行作业,但现在它们根本不会被 JobTracker 列出。不在Nodes
, Blacklisted Nodes
,Graylisted Nodes
或下Excluded Nodes
。
TaskTracker 进程仍在主机上运行。我检查ssh
了 JobTracker 和消失的节点之间是否存在网络连接和能力。
在日志中,我看到最近我们开始出现很多失败案例,例如:https ://issues.apache.org/jira/browse/MAPREDUCE-5 。Jetty /mapOutput
错误与 TaskTracker 停止之间存在相关性。
有谁知道什么会导致 TaskTracker 静默失败而不是被放入黑名单节点列表中?
我已将 TaskTracker 线程与jstack
.
似乎 TaskTracker 正在尝试关闭,但正在等待某些东西。
死锁检测:没有发现死锁。
线程 14005:
(state = BLOCKED)
- java.lang.Thread.sleep(long) @bci=0 (Compiled frame; information may be imprecise)
- org.apache.hadoop.ipc.Client.stop() @bci=105, line=973 (Compiled frame)
- org.apache.hadoop.ipc.RPC$ClientCache.stopClient(org.apache.hadoop.ipc.Client) @bci=47, line=191 (Interpreted frame)
- org.apache.hadoop.ipc.RPC$ClientCache.access$500(org.apache.hadoop.ipc.RPC$ClientCache, org.apache.hadoop.ipc.Client) @bci=2, line=140 (Interpreted frame)
- org.apache.hadoop.ipc.RPC$Invoker.close() @bci=19, line=238 (Interpreted frame)
- org.apache.hadoop.ipc.RPC$Invoker.access$600(org.apache.hadoop.ipc.RPC$Invoker) @bci=1, line=203 (Interpreted frame)
- org.apache.hadoop.ipc.RPC.stopProxy(org.apache.hadoop.ipc.VersionedProtocol) @bci=11, line=439 (Interpreted frame)
- org.apache.hadoop.hdfs.DFSClient.close() @bci=34, line=283 (Interpreted frame)
- org.apache.hadoop.hdfs.DistributedFileSystem.close() @bci=8, line=328 (Interpreted frame)
- org.apache.hadoop.fs.FileSystem$Cache.closeAll() @bci=78, line=1446 (Interpreted frame)
- org.apache.hadoop.fs.FileSystem.closeAll() @bci=40, line=277 (Interpreted frame)
- org.apache.hadoop.fs.FileSystem$ClientFinalizer.run() @bci=0, line=260 (Interpreted frame)
线程 18731:
(state = BLOCKED)
- sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
- java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=156 (Compiled frame)
- java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() @bci=42, line=1987 (Compiled frame)
- java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=399 (Compiled frame)
- org.apache.hadoop.mapred.TaskTracker$1.run() @bci=7, line=434 (Compiled frame)
- java.lang.Thread.run() @bci=11, line=662 (Interpreted frame)
线程 18730:
(state = BLOCKED)
- sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
- java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=156 (Compiled frame)
- java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() @bci=42, line=1987 (Compiled frame)
- java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=399 (Compiled frame)
- org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager.monitor() @bci=4, line=131 (Interpreted frame)
- org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager$1.run() @bci=4, line=66 (Compiled frame)
线程 18729:
(state = BLOCKED)
- java.lang.Thread.sleep(long) @bci=0 (Compiled frame; information may be imprecise)
- org.apache.hadoop.mapred.UserLogCleaner.run() @bci=4, line=93 (Interpreted frame)
线程 18728:
(state = BLOCKED)
- java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise)
- java.util.TimerThread.mainLoop() @bci=201, line=509 (Compiled frame)
- java.util.TimerThread.run() @bci=1, line=462 (Interpreted frame)
线程 18724:
(state = IN_NATIVE)
- sun.nio.ch.EPollArrayWrapper.epollWait(long, int, long, int) @bci=0 (Compiled frame; information may be imprecise)
- sun.nio.ch.EPollArrayWrapper.poll(long) @bci=18, line=210 (Compiled frame)
- sun.nio.ch.EPollSelectorImpl.doSelect(long) @bci=28, line=65 (Compiled frame)
- sun.nio.ch.SelectorImpl.lockAndDoSelect(long) @bci=37, line=69 (Compiled frame)
- sun.nio.ch.SelectorImpl.select(long) @bci=30, line=80 (Compiled frame)
- org.mortbay.io.nio.SelectorManager$SelectSet.doSelect() @bci=615, line=457 (Compiled frame)
- org.mortbay.io.nio.SelectorManager.doSelect(int) @bci=24, line=190 (Compiled frame)
- org.mortbay.jetty.nio.SelectChannelConnector.accept(int) @bci=5, line=124 (Compiled frame)
- org.mortbay.jetty.AbstractConnector$Acceptor.run() @bci=151, line=706 (Compiled frame)
- org.mortbay.thread.QueuedThreadPool$PoolThread.run() @bci=25, line=520 (Interpreted frame)
线程 18671:
(state = IN_NATIVE)
- sun.nio.ch.EPollArrayWrapper.epollWait(long, int, long, int) @bci=0 (Compiled frame; information may be imprecise)
- sun.nio.ch.EPollArrayWrapper.poll(long) @bci=18, line=210 (Compiled frame)
- sun.nio.ch.EPollSelectorImpl.doSelect(long) @bci=28, line=65 (Compiled frame)
- sun.nio.ch.SelectorImpl.lockAndDoSelect(long) @bci=37, line=69 (Compiled frame)
- sun.nio.ch.SelectorImpl.select(long) @bci=30, line=80 (Compiled frame)
- sun.nio.ch.SelectorImpl.select() @bci=2, line=84 (Compiled frame)
- org.apache.hadoop.ipc.Server$Listener$Reader.run() @bci=33, line=333 (Compiled frame)
- java.util.concurrent.ThreadPoolExecutor$Worker.runTask(java.lang.Runnable) @bci=59, line=886 (Interpreted frame)
- java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=28, line=908 (Interpreted frame)
- java.lang.Thread.run() @bci=11, line=662 (Interpreted frame)
线程 18667:
(state = BLOCKED)
- sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
- java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=156 (Compiled frame)
- java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() @bci=42, line=1987 (Compiled frame)
- java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=399 (Compiled frame)
- org.apache.hadoop.mapred.CleanupQueue$PathCleanupThread.run() @bci=47, line=130 (Compiled frame)
线程 18661:
(state = BLOCKED)
线程 18660:
(state = BLOCKED)
- java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise)
- java.lang.ref.ReferenceQueue.remove(long) @bci=44, line=118 (Compiled frame)
- java.lang.ref.ReferenceQueue.remove() @bci=2, line=134 (Compiled frame)
- java.lang.ref.Finalizer$FinalizerThread.run() @bci=3, line=159 (Compiled frame)
线程 18659:
(state = BLOCKED)
- java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise)
- java.lang.Object.wait() @bci=2, line=485 (Compiled frame)
- java.lang.ref.Reference$ReferenceHandler.run() @bci=46, line=116 (Compiled frame)
线程 18644:
(state = BLOCKED)
- java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise)
- java.lang.Thread.join(long) @bci=38, line=1186 (Compiled frame)
- java.lang.Thread.join() @bci=2, line=1239 (Interpreted frame)
- java.lang.ApplicationShutdownHooks.runHooks() @bci=87, line=79 (Interpreted frame)
- java.lang.ApplicationShutdownHooks$1.run() @bci=0, line=24 (Interpreted frame)
- java.lang.Shutdown.runHooks() @bci=23, line=79 (Interpreted frame)
- java.lang.Shutdown.sequence() @bci=26, line=123 (Interpreted frame)
- java.lang.Shutdown.exit(int) @bci=96, line=168 (Interpreted frame)
- java.lang.Runtime.exit(int) @bci=14, line=90 (Interpreted frame)
- java.lang.System.exit(int) @bci=4, line=904 (Interpreted frame)
- org.apache.hadoop.mapred.TaskTracker.main(java.lang.String[]) @bci=114, line=3722 (Interpreted frame)