hive - 由于内存导致 Hive 查询中的问题

Question

我们有插入查询，我们试图通过从非分区表中读取数据来将数据插入分区表。

询问 -

 insert into db1.fact_table PARTITION(part_col1, part_col2) 
 ( col1,
 col2,
 col3,
 col4,
 col5,
 col6,
 .
 .
 .
 .
 .
 .
 .
 col32
 LOAD_DT,
 part_col1,
 Part_col2 ) 
 select 
 col1,
 col2,
 col3,
 col4,
 col5,
 col6,
 .
 .
 .
 .
 .
 .
 .
 col32,
 part_col1,
 Part_col2
 from db1.main_table WHERE col1=0;

表有 34 列，主表中的记录数取决于我们每天收到的输入文件的大小。并且我们在每次运行中插入的分区数（part_col1、part_col2）可能从 4000 到 5000 不等

有时此查询因以下问题而失败。

2019-04-28 13:23:31,715 Stage-1 map = 95%, reduce = 0%, Cumulative CPU 177220.23 sec 2019-04-28 13:24:25,989 Stage-1 map = 100%, reduce = 0%,累积CPU 163577.82 SEC MAPREDUCE总累积CPU时间：1天21小时26分17秒820秒结束工作= job_15556004136988888888_155295在作业期间出现错误，在作业期间出现错误，从ID：task_1556004136988_155295_m_000004（以及更多）来自作业 job_1556004136988_155295 失败次数最多的任务（4）：----- 任务 ID：task_1556004136988_155295_m_000000
----- 此任务的诊断消息：容器启动异常。容器 id：container_e81_1556004136988_155295_01_000015 退出代码：255 堆栈跟踪：ExitCodeException exitCode=255：在 org.apache.hadoop.util.Shell.runCommand(Shell.java:563) 在 org.apache.hadoop.util.Shell.run(Shell. java:460) 在 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:748) 在 org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:305) 在 org .apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:356) 在 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:

当前的蜂巢属性。

使用 Tez 引擎 -

set hive.execution.engine=tez;
set hive.tez.container.size=3072;
set hive.tez.java.opts=-Xmx1640m;
set hive.vectorized.execution.enabled=false;
set hive.vectorized.execution.reduce.enabled=false;
set hive.enforce.bucketing=true;
set hive.exec.parallel=true;
set hive.auto.convert.join=false;
set hive.enforce.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
set hive.optimize.bucketmapjoin=true;
set hive.exec.tmp.maprfsvolume=false;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.stats.fetch.partition.stats=true;
set hive.support.concurrency=true;
set hive.exec.max.dynamic.partitions=999999999;
set hive.enforce.bucketing=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set hive.compactor.initiator.on=true;

根据其他团队的意见，我们将引擎更改为 mr 并且属性是 -

set hive.execution.engine=mr;
set hive.auto.convert.join=false;
set mapreduce.map.memory.mb=16384;
set mapreduce.map.java.opts=-Xmx14745m;
set mapreduce.reduce.memory.mb=16384;
set mapreduce.reduce.java.opts=-Xmx14745m;

随着这些属性查询完成几次，没有任何错误。

我如何调试这些问题，是否有任何我们可以设置的配置单元属性，以便我们将来不会遇到这些问题。

score 1 · Accepted Answer

添加按分区键分发。每个reducer只会处理一个分区，而不是每个分区，这样会减少内存消耗，因为reducer会创建更少的文件，保留更少的缓冲区。

insert into db1.fact_table PARTITION(part_col1, part_col2) 
select 
col1,
...

col32,
part_col1,
Part_col2
 from db1.main_table WHERE col1=0
distribute by part_col1, Part_col2; --add this

使用谓词下推，如果源文件是 ORC，它可能有助于过滤：

SET hive.optimize.ppd=true;
SET hive.optimize.ppd.storage=true;
SET hive.optimize.index.filter=true;

调整适当的映射器和减速器并行度：https ://stackoverflow.com/a/48487306/2700344

如果您的数据太大并且按分区键分配不均匀，请在分区键之外添加随机分配。这将有助于倾斜数据：

distribute by part_col1, Part_col2, FLOOR(RAND()*100.0)%20;

另请阅读https://stackoverflow.com/a/55375261/2700344

hive - 由于内存导致 Hive 查询中的问题

1 回答 1

Related

Reference