hive - 用 Hive/Pig 填充不存在的数据

Question

我有一个具有以下结构的配置单元表：

id1, id2, year, value 
1, 1, 2000, 20
1, 1, 2002, 23
1, 1, 2003, 24
1, 2, 1999, 34
1, 2, 2000, 35
1, 2, 2001, 37
2, 3, 2005, 50
2, 3, 2006, 56
2, 3, 2008, 60

我有 2 个标识“用户”的 ID，对于每个用户和年份，我都有一个值，但是有些年份没有值，没有出现在表中。我想为每个 id [id1,id2] 和年份（考虑最小和最大年份之间的所有年份）添加一个值，如果一年不存在，则使用上一年的值。所以表格应该变成：

id1, id2, year, value 
1, 1, 2000, 20
1, 1, 2001, 20
1, 1, 2002, 23
1, 1, 2003, 24
1, 2, 1999, 34
1, 2, 2000, 35
1, 2, 2001, 37
2, 3, 2005, 50
2, 3, 2006, 56
2, 3, 2007, 56
2, 3, 2008, 60

我需要在蜂巢或猪中这样做，或者在最坏的情况下我可以使用火花

谢谢，

score 0 · Accepted Answer

我会在使用临时表时这样做。每个 id1 和 id2 的年份各不相同，因此我将为每个 id1、id2 创建一系列年份，而不是为所有年份创建一系列年份。1) 获取每个 id1、id2 的最小年份和最大年份。调用这个 series_dtes 表 2）对手边的表进行左连接（我称之为 cal_date） 3）从组合的 series_dtes 表和 cal_date 表中创建一个临时表。这将填充每个 id1、id2 的缺失年份，例如 2001 年和 2007 年。 4) 使用滞后函数填写 2001 年和 2007 年的缺失值。

create table tmp as 
with  series_dtes as (
select id1, id2, (t.min_dt+pe.idx) as series_year
from (select id1, id2, min(year) as min_dt, max(year) as max_dt from cal_date group by id1, id2) t
lateral view posexplode(split(space(t.max_dt-t.min_dt),' ')) pe as idx, dte)
select dte.id1, dte.id2, dte.series_year, t.value
from series_dtes dte
left join cal_date  t
on dte.series_year=t.year and t.id1=dte.id1 and t.id2=dte.id2
order by dte.id1, dte.id2, dte.series_year;

select id1, id2, series_year as year, 
(case when value is null then (lag(value) over (partition by id1,id2 order by series_year)) else value end) as value
from tmp;

Result:
id1     id2     year    value
1       1       2000    20
1       1       2001    20
1       1       2002    23
1       1       2003    24
1       2       1999    34
1       2       2000    35
1       2       2001    37
2       3       2005    50
2       3       2006    56
2       3       2007    56
2       3       2008    60

score 0 · Accepted Answer

如果可以将年份存储为表格，则可以最好地实现这一点。

create table dbname.years 
location 'hdfs_location' as
select 2000 as yr union all select 2001 as yr --include as many years as possible

1）有了这个表，id可以交叉连接以生成所有年份组合，然后生成left join原始表。

2) 然后将行分组，因此null上一步中的值（原始表中缺少 id 的年份）被分配与之前的非空值相同的组。这是通过运行总和来完成的。运行子查询以查看如何分配组。

3) 此后，max为每个 id1,id2,group 组合选择。

select id1,id2,yr,max(val) over(partition by id1,id2,grp) as val
from (select i.id1,i.id2,y.yr,t.val
      ,sum(case when t.val is null then 0 else 1 end) 
       over(partition by i.id1,i.id2 order by y.yr) as grp
      from (select distinct id1,id2 from tbl) i
      cross join (select yr from years) y
      left join tbl t on i.id1=t.id1 and i.id2=t.id2 and y.yr=t.yr
     ) t

hive - 用 Hive/Pig 填充不存在的数据

2 回答 2

Related

Reference