我有如下数据
1,ref1,1200,USD,CR
2,ref1,1200,USD,DR
3,ref2,2100,USD,DR
4,ref2,800,USD,CR
5,ref2,700,USD,CR
6,ref2,600,USD,CR
我想将这些记录分组,其中 field2 匹配,sum(field3) 匹配并且 field5 是相反的(意味着如果 lhs 是“CR”那么 rhs 应该是“DR”,反之亦然)
如何使用 pig 脚本实现这一点?
我有如下数据
1,ref1,1200,USD,CR
2,ref1,1200,USD,DR
3,ref2,2100,USD,DR
4,ref2,800,USD,CR
5,ref2,700,USD,CR
6,ref2,600,USD,CR
我想将这些记录分组,其中 field2 匹配,sum(field3) 匹配并且 field5 是相反的(意味着如果 lhs 是“CR”那么 rhs 应该是“DR”,反之亦然)
如何使用 pig 脚本实现这一点?
我不确定我是否理解您的要求,但您可以加载数据,分成两组(过滤器/拆分)和 cogroup,例如:
data = load ... as (field1: int, field2: chararray, field3: int, field4: chararray, field5: chararray);
crs= filter data by field5='CR';
crs_grp = group crs by field1;
crs_agg = foreach crs_grp generate group.field1 as field1, sum(crs.field3);
drs = filter data by field5='DR';
drs_grp = group drs by field1;
drs_agg = foreach drs_grp generate group.field1 as field1, sum(drs.field3);
g = COGROUP crs_agg BY (field1, field3), drs_agg BY (field1, field3);
你也可以这样做:
data = LOAD 'myData' USING PigStorage(',') AS
(field1: int, field2: chararray,
field3: int, field4: chararray,
field5: chararray) ;
B = FOREACH (GROUP data BY (field2, field5)) GENERATE group.field2, data ;
-- Since in B there will always be two sets of field2 (one for CR and one for DR)
-- grouping by field2 again will get all pairs of CR and DR
-- (where the sums are equal of course)
C = GROUP B BY (field2, SUM(field3)) ;
最后一步的模式和输出:
C: {group: (field2: chararray,long),B: {(field2: chararray,data: {(field1: int,field2: chararray,field3: int,field4: chararray,field5: chararray)})}}
((ref1,1200),{(ref1,{(1,ref1,1200,USD,CR)}),(ref1,{(2,ref1,1200,USD,DR)})})
((ref2,2100),{(ref2,{(4,ref2,800,USD,CR),(5,ref2,700,USD,CR),(6,ref2,600,USD,CR)}),(ref2,{(3,ref2,2100,USD,DR)})})
输出 put 现在有点笨拙,但这会清除它:
-- Make sure to look at the schema for C above
D = FOREACH C {
-- B is a bag containing tuples in the form: B: {(field2, data)}
-- What we want is to just extract out the data field and nothing else
-- so we can go over each tuple in the bag and pull out
-- the second element (the data we want).
justbag = FOREACH B GENERATE FLATTEN($1) ;
-- Without FLATTEN the schema for justbag would be:
-- justbag: {(data: (field1, ...))}
-- FLATTEN makes it easier to access the fields by removing data:
-- justbag: {(field1, ...)}
GENERATE justbag ;
}
进入这个:
D: {justbag: {(data::field1: int,data::field2: chararray,data::field3: int,data::field4: chararray,data::field5: chararray)}}
({(1,ref1,1200,USD,CR),(2,ref1,1200,USD,DR)})
({(4,ref2,800,USD,CR),(5,ref2,700,USD,CR),(6,ref2,600,USD,CR),(3,ref2,2100,USD,DR)})