在 PigLatin 中,我想按 2 次分组,以便选择具有 2 个不同规律的行。
我很难解释这个问题,所以这里有一个例子。假设我想获取与我年龄最近($my_age)并且有很多钱的人的规格。
Relation A is four columns, (name, address, zipcode, age, money)
B = GROUP A BY (address, zipcode); # group by the address
-- generate the address, the person's age ...
C = FOREACH B GENERATE group, MIN($my_age - age) AS min_age, FLATTEN(A);
D = FILTER C BY min_age == age
--Then group by as to select the richest, group by fails :
E = GROUP D BY group; or E = GROUP D BY (address, zipcode);
-- The end would work
D = FOREACH E GENERATE group, MAX(money) AS max_money, FLATTEN(A);
F = FILTER C BY max_money == money;
我试图同时过滤最近和最富有的人,但它不起作用,因为你可以拥有最富有的人和我最年长的人。
另一个更现实的例子是:
您有类似的需求文件:iddem、idodem、datedem
您有如下操作文件:idope、labelope、dateope、idoftheday、infope
我想返回符合以下要求的操作:
dateope 必须与 datedem 最接近。
如果 datedem - date_ope > 0,那么我必须选择带有 max(idoftheday) 的操作,否则我必须选择带有 min(idoftheday) 的操作。
Relation A is 5 columns (idope,labelope,dateope,idoftheday,infope)
Relation B is 3 columns (iddem, idopedem, datedem)
C = JOIN A BY idope, B BY idopedem;
D = FOREACH E GENERATE iddem, idope, datedem, dateope, ABS(datedem - dateope) AS datedelta, idoftheday, infope;
E = GROUP C BY iddem;
F = FOREACH D GENERATE group, MIN(C.datedelta) AS deltamin, FLATTEN(D);
G = FILTER F BY deltamin == datedelta;
--Then I must group by another time as to select the min or max idoftheday
H = GROUP G BY group; --Does not work when dump
H = GROUP G BY iddem; --Does not work when dump
I = FOREACH H GENERATE group, (datedem - dateope >= 0 ? max(idoftheday) as idofdaysel : min(idoftheday) as idofdaysel), FLATTEN(D);
J = FILTER F BY idofdaysel == idoftheday;
DUMP J;
第二个示例中的数据(注意日期已经是 Unix 格式):您有如下需求文件:
1, 'ctr1', 1359460800000
2, 'ctr2', 1354363200000
您有如下操作文件:idope、labelope、dateope、idoftheday、infope
'ctr0','toto',1359460800000,1,'blabla0'
'ctr0','tata',1359460800000,2,'blabla1'
'ctr1','toto',1359460800000,1,'blabla2'
'ctr1','tata',1359460800000,2,'blabla3'
'ctr2','toto',1359460800000,1,'blabla4'
'ctr2','tata',1359460800000,2,'blabla5'
'ctr3','toto',1359460800000,1,'blabla6'
'ctr3','tata',1359460800000,2,'blabla7'
结果必须是这样的:
1, 'ctr1', 'tata',1359460800000,2,'blabla3'
2, 'ctr2', 'toto',1359460800000,1,'blabla4'