azure - OutOfMemory 异常 - HDInsight LLAP 群集中的 Hive 多联接查询

Question

我正在 Azure HDInsight LLAP Hive 群集中尝试配置单元多连接查询。

它在运行大约 20 分钟后给出OutOfMemory异常。

询问：

创建表 tt as SELECT given_qad_sedol as Sedol7, f.ws_cd, f.ws_id, f.cntry_cd, f.cntry_name, f.entity_name, f.stmt_sub_typ, f.stmt_sub_typ_desc, f.stmt_typ, f.stmt_typ_desc, f.item, f .item_name，f.short_mnem，f.item_mnem，coalesce(f1.frq，f.frq) 作为 frq，coalesce(f1.frq_desc，f.frq_desc) 作为 frq_desc，f.yr，f.seq，f.fiscal_per_end_date，coalesce （f1.erng_rpt_date，f.erng_rpt_date）作为erng_rpt_date，f.per_update_flg，f.per_update_desc，f.per_srce，f.reported_curr，coalesce（f1.reported_val，f.reported_val）作为reported_val，f.exch_rate，f.ws_curr，f .unit_typ FROM imdl_irdp_dev.cur_std_fundamentals f JOIN imdl_irdp_dev.cur_ws_comp_map cm ON f.ws_cd = cm.ws_cd JOIN imdl_irdp_dev.cur_scrty_sedol_chg_hstry s ON cm.qad_scrty_cd = s.qad_scrty_cd AND cm.typ = s.typ LE imdl_irdp_dev.cur_std_fundamentals f1 ON f.ws_cd = f1.ws_cd AND f.item = f1.item AND f.yr = f1.yr AND f.seq = f1.seq AND f1.frq = 'B' ORDER BY yr,seq, stmt_typ_desc，项目；

突出显示的表有大约 15 亿条记录。我们无法更改查询，因为它是业务需求。但是我们可以优化它，前提是查询结果不应该改变。

我也尝试了以下选项，但仍然没有运气。

set mapreduce.map.memory.mb=8000;
set mapreduce.map.java.opts=-Xmx46080m;
set mapreduce.reduce.memory.mb=8000;
set mapreduce.reduce.java.opts=-Xmx7000m;
set hive.tez.container.size=8000;
set hive.tez.java.opts=-Xmx7000m;   
set hive.auto.convert.join.noconditionaltask.size=1000000000;
set set dfs.blocksize=1073741824;

有什么办法可以优化这个查询吗？

score 0 · Accepted Answer

重新订购连接后效果很好。

选择 'B07C796' 作为 Sedol7,f2.ws_cd,f2.cntry_name,f2.entity_name,f2.stmt_sub_typ_desc,f2.stmt_typ,f2.item,f2.item_mnem,coalesce(f3.frq,f2.frq) 作为 frq,coalesce( f3.frq_desc,f2.frq_desc) 作为 frq_desc,f2.yr,f2.seq,f2.fiscal_per_end_date,coalesce(f3.erng_rpt_date,f2.erng_rpt_date) 作为 erng_rpt_date,f2.per_update_flg,f2.per_update_desc,f2.per_srce,f2。 reports_curr,coalesce(f3.reported_val,f2.reported_val) as reported_val,f2.exch_rate,f2.ws_curr,f2.unit_typ,f2.ws_val from (select f.ws_cd,f.ws_id,f.cntry_cd,f.cntry_name,f .entity_name,f.stmt_sub_typ,f.stmt_sub_typ_desc,f.stmt_typ,f.stmt_typ_desc,f.item,f.item_name,f.short_mnem,f.item_mnem,f.frq,f.frq_desc,f.yr,f.seq ,f.fiscal_per_end_date,f.erng_rpt_date,f.per_update_flg,f.per_update_desc,f.per_srce,f.reported_curr,f.reported_val,f.exch_rate,f.ws_curr,f.unit_typ,f。ws_val from (select comp_map.ws_cd from (select qad_scrty_cd, typ from imdl_irdp_dev.cur_scrty_sedol_chg_hstry where given_qad_sedol in (substr('B07C796',0,6))) chg_hstry join (select ws_cd, typ, qad_scrty_cur_chg_hstrymap) .qad_scrty_cd = comp_map.qad_scrty_cd AND chg_hstry.typ = comp_map.typ) f1 join (选择 ws_cd,ws_id,cntry_cd,cntry_name,entity_name,stmt_sub_typ,stmt_sub_typ_desc,stmt_typ,stmt_typ_desc,item,item_frrq,rq,f ,seq,fiscal_per_end_date,erng_rpt_date,per_update_flg,per_update_desc,per_srce,reported_curr,reported_val,exch_rate,ws_curr,unit_typ,ws_val 来自 imdl_irdp_dev.cur_std_fundamentals_part 其中 yr BETWEEN AND 2018 AND frq='A', AND 'BS'ty' ','CF','其他')) f ON f.ws_cd = f1.ws_cd) f2 left join (select frq, frq_desc, erng_rpt_date,reported_val, ws_cd, item, yr, seq from imdl_irdp_dev.cur_std_fundamentals_part where yr BETWEEN 2018 AND 2018 AND frq='B' AND stmt_typ in ('IS', 'BS','CF','Other')) f3 ON f2.ws_cd = f3.ws_cd AND f2.item = f3.item AND f2.yr = f3.yr AND f2.seq = f3.seq 按 f2 排序。年，f2.seq，f2.item；

score 0 · Accepted Answer

您可以遵循 2 个最佳实践。引入分区以读取必要的数据并将文件格式更改为 ORC，因为您只选择了几列。这将减少数据加载量，并使您的执行速度更快。在 cur_std_fundamentals 上应用过滤器(f1.frq = 'B')时的记录数是多少。在其上进行分区取决于数据分布。

您还可以通过执行自加入 1 来中断查询并查看性能。

你有没有使用任何压缩？

azure - OutOfMemory 异常 - HDInsight LLAP 群集中的 Hive 多联接查询

2 回答 2

Related

Reference