我有一个在 hive 13(YARN) 上运行良好的脚本我正在试验 tez。当我对大型数据集运行查询时,遇到以下错误。
0 FATAL [Socket Reader #1 for port 55739] org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Socket Reader #1 for port 55739,5,main] threw an Error. Shutting down now...
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1510)
at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:750)
at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:624)
at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:595)
2015-12-07 20:31:32,859 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
java.lang.OutOfMemoryError: GC overhead limit exceeded
2015-12-07 20:31:30,590 WARN [IPC Server handler 0 on 55739] org.apache.hadoop.ipc.Server: IPC Server handler 0 on 55739, call heartbeat({ containerId=container_1449516549171_0001_01_000100, requestId=10184, startIndex=0, maxEventsToGet=0, taskAttemptId=null, eventCount=0 }), rpc version=2, client version=19, methodsFingerPrint=557389974 from 10.10.30.35:47028 Call#11165 Retry#0: error: java.lang.OutOfMemoryError: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
at javax.security.auth.SubjectDomainCombiner.optimize(SubjectDomainCombiner.java:464)
at javax.security.auth.SubjectDomainCombiner.combine(SubjectDomainCombiner.java:267)
at java.security.AccessControlContext.goCombiner(AccessControlContext.java:499)
at java.security.AccessControlContext.optimize(AccessControlContext.java:407)
at java.security.AccessController.getContext(AccessController.java:501)
at javax.security.auth.Subject.doAs(Subject.java:412)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
2015-12-07 20:32:53,495 INFO [Thread-60] amazon.emr.metrics.MetricsSaver: Saved 4:3 records to /mnt/var/em/raw/i-782f08c8_20151207_7921_07921_raw.bin
2015-12-07 20:32:53,495 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Exiting, bbye..
2015-12-07 20:32:50,435 INFO [IPC Server handler 20 on 55739] org.apache.hadoop.ipc.Server: IPC Server handler 20 on 55739, call getTask(org.apache.tez.common.ContainerContext@409a6aa9), rpc version=2, client version=19, methodsFingerPrint=557389974 from 10.10.30.33:33644 Call#11094 Retry#0: error: java.io.IOException: java.lang.OutOfMemoryError: GC overhead limit exceeded
java.io.IOException: java.lang.OutOfMemoryError: GC overhead limit exceeded
2015-12-07 20:32:29,117 WARN [IPC Server handler 23 on 55739] org.apache.hadoop.ipc.Server: IPC Server handler 23 on 55739, call getTask(org.apache.tez.common.ContainerContext@7c7e6992), rpc version=2, client version=19, methodsFingerPrint=557389974 from 10.10.30.38:44218 Call#11260 Retry#0: error: java.lang.OutOfMemoryError: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
2015-12-07 20:32:53,497 INFO [Thread-60] amazon.emr.metrics.MetricsSaver: Saved 1:1 records to /mnt/var/em/raw/i-782f08c8_20151207_7921_07921_raw.bin
2015-12-07 20:32:53,498 INFO [Thread-61] amazon.emr.metrics.MetricsSaver: Saved 1:1 records to /mnt/var/em/raw/i-782f08c8_20151207_7921_07921_raw.bin
2015-12-07 20:32:53,498 INFO [Thread-2] org.apache.tez.dag.app.DAGAppMaster: DAGAppMaster received a signal. Signaling TaskScheduler
2015-12-07 20:32:53,498 INFO [Thread-2] org.apache.tez.dag.app.rm.TaskSchedulerEventHandler: TaskScheduler notified that iSignalled was : true
2015-12-07 20:32:53,499 INFO [Thread-2] org.apache.tez.dag.history.HistoryEventHandler: Stopping HistoryEventHandler
2015-12-07 20:32:53,499 INFO [Thread-2] org.apache.tez.dag.history.recovery.RecoveryService: Stopping RecoveryService
2015-12-07 20:32:53,499 INFO [Thread-2] org.apache.tez.dag.history.recovery.RecoveryService: Closing Summary Stream
2015-12-07 20:32:53,499 INFO [LeaseRenewer:hadoop@10.10.30.148:9000] org.apache.hadoop.util.ExitUtil: Halt with status -1 Message: HaltException
关于 EMR 集群的一些规格 - m1.xlarge 主节点、4 个 r3.8xlarge 核心节点、2 个 r3.8xlarge 任务节点(大约 1.3 TB 内存)
我尝试了以下设置,但它们不起作用。
set tez.task.resource.memory.mb=8000;
SET hive.tez.container.size=30208;
SET hive.tez.java.opts=-Xmx24168m;
也因为亚马逊在 EMR 上提供了 0.4.1 版本的 tez,我现在正在运行它(也许这就是问题?)
任何人都可以请帮助解决它。我试图调整一些与内存相关的属性,如 mapreduce.map.memory.mb 但还没有运气