0

有没有办法在 Pig 中创建自定义代理键?

例如:我们有如下数据

Salary City Name

20000 newyork john   
30000 sydney joseph   
60000 delhi mike   
30000 sydney joseph

对于这些数据,我们需要创建如下代理键,结果应如下所示。

     Salary City Name

SCN1 20000 newyork john    
SCN2 30000 sydney joseph   
SCN3 60000 delhi mike  
SCN2 30000 sydney joseph

而不是创建随机唯一键?

提前致谢!!。

4

2 回答 2

1

首先区分数据,使用 RANK 和 CONCAT 获取每个不同行的自定义键。然后将 distinct 与原始数据集连接。最后生成所需的列。

A = LOAD 'data.txt' USING PigStorage('\t');
B = DISTINCT A;
C = RANK B;
D = FOREACH C GENERATE CONCAT('SCN',$0),$1,$2,$3;
E = JOIN A BY ($0,$1,$2),D BY ($1,$2,$3);
F = FOREACH E GENERATE E::$3,E::$0,E::$1,E::$2;
DUMP F;

这就是它如何处理样本数据

一个

20000 newyork john   
30000 sydney joseph   
60000 delhi mike   
30000 sydney joseph

20000 newyork john   
30000 sydney joseph   
60000 delhi mike

C

1 20000 newyork john   
2 30000 sydney joseph   
3 60000 delhi mike

D

SCN1 20000 newyork john   
SCN2 30000 sydney joseph   
SCN3 60000 delhi mike

20000 newyork john SCN1 20000 newyork john     
30000 sydney joseph SCN2 30000 sydney joseph   
60000 delhi mike SCN3 60000 delhi mike 
30000 sydney joseph SCN2 30000 sydney joseph 

F

SCN1 20000 newyork john    
SCN2 30000 sydney joseph   
SCN3 60000 delhi mike  
SCN2 30000 sydney joseph
于 2016-04-20T19:11:47.817 回答
0

感谢 Inquistive Mind 帮助我生成唯一的代理键。这是我测试过并且运行良好的猪脚本。

 A = LOAD '/user/root5/data3.txt' USING PigStorage(',');
 B = DISTINCT A;
 C = RANK B;
 D = FOREACH C GENERATE CONCAT('SCN',$0),$1,$2,$3;
 E = JOIN A BY ($0,$1,$2),D BY ($1,$2,$3);
 F = FOREACH E GENERATE $3, $0, $1, $2;
 DUMP F;

每个步骤的输出如下:

DUMP A;
(20000,newyork,john)
(30000,sydney,joseph)
(60000,delhi,mike)
(20000,newyork,john)
(30000,sydney,mike)
(60000,delhi,mike)  

DUMP B;
(20000,newyork,john)
(30000,sydney,mike)
(30000,sydney,joseph)
(60000,delhi,mike)

DUMP C;
(1,20000,newyork,john)
(2,30000,sydney,mike)
(3,30000,sydney,joseph)
(4,60000,delhi,mike)

DUMP D;
(SCN1,20000,newyork,john)
(SCN2,30000,sydney,mike)
(SCN3,30000,sydney,joseph)
(SCN4,60000,delhi,mike)

DUMP E;
(20000,newyork,john,SCN1,20000,newyork,john)
(20000,newyork,john,SCN1,20000,newyork,john)
(30000,sydney,mike,SCN2,30000,sydney,mike)
(30000,sydney,joseph,SCN3,30000,sydney,joseph)
(60000,delhi,mike,SCN4,60000,delhi,mike)
(60000,delhi,mike,SCN4,60000,delhi,mike)

DUMP F;
(SCN1,20000,newyork,john)
(SCN1,20000,newyork,john)
(SCN2,30000,sydney,mike)
(SCN3,30000,sydney,joseph)
(SCN4,60000,delhi,mike)
(SCN4,60000,delhi,mike)'
于 2016-04-28T05:19:50.127 回答