0

我正在尝试从一个组中选择多个共享相同键的记录,但不确定如何过滤它。

例如,使用以下数据:

D1,20130701,M1,V1

D1,20130701,M2,V2

D1,20130702,M1,V3

D1,20130703,M1,V4

D1,20130703,M2,V5

D2,20130701,M1,V1

D2,20130702,M1,V3

D2,20130703,M1,V4

和一个负载语句:

A = load '/home/hduser/t.csv' 
        using PigStorage(',') 
        as (
            device:chararray, 
            dt:chararray, 
            metric:chararray, 
            value:chararray
        );

C = group A by (device, dt);

产生:

((D1,20130701),{(D1,20130701,M1,V1),(D1,20130701,M2,V2)})

((D1,20130702),{(D1,20130702,M1,V3)})

((D1,20130703),{(D1,20130703,M1,V4),(D1,20130703,M2,V5)})

((D2,20130701),{(D2,20130701,M1,V1)})

((D2,20130702),{(D2,20130702,M1,V3)})

((D2,20130703),{(D2,20130703,M1,V4)})

问题是我应该怎么做才能过滤掉,这样我只能得到粗体线,逻辑适用于每个设备(D1 / D2 ...),给我日期最低的行?

如果我只按设备分组:

B = group A by device;

我得到以下两行:

(D1,{(D1,20130701,M1,V1),(D1,20130701,M2,V2),(D1,20130702,M1,V3),(D1,20130703,M1,V4),(D1,20130703,M2 ,V5)})

(D2,{(D2,20130701,M1,V1),(D2,20130702,M1,V3),(D2,20130703,M1,V4)})

但是我不能在 foreach 中使用限制,因为每个设备的记录数是可变的。

有什么想法吗?对猪来说相当陌生!

非常感谢。

4

1 回答 1

0

一种方法是

 records = LOAD '/user/nubes/ncdc/micro-tab/top.txt' AS (
        device:chararray, 
        dt:int, 
        metric:chararray, 
        value:chararray);


 records_group = group records by (device);

 with_min = FOREACH records_group 
       GENERATE
       FLATTEN(records), MIN(records.dt) ;

 filterRecords = filter with_min by ( $1 == $4 );

i/p 是

D1 20130701 M1 V1 D1 20130701 M2 V2

D1 20130702 M1 V3

D1 20130703 M1 V4

D1 20130703 M2 V5

D2 20130702 M1 V3

D2 20130703 M1 V4

输出是

(D1,20130701,M1,V1,20130701)

(D1,20130701,M2,V2,20130701)

(D2,20130702,M1,V3,20130702)

于 2013-07-14T01:06:29.357 回答