说我有一个结构
{1001, {{id=1001, count=20, key=a}, {id=1001, count=30, key=b}}}
{1002, {{id=1002, count=40, key=a}, {id=1001, count=50, key=b}}}
我希望它把它变成
{id=1001, a=20, b=30}
{id=1002, a=40, b=50}
我可以使用哪些 Pig 命令来执行此操作?
说我有一个结构
{1001, {{id=1001, count=20, key=a}, {id=1001, count=30, key=b}}}
{1002, {{id=1002, count=40, key=a}, {id=1001, count=50, key=b}}}
我希望它把它变成
{id=1001, a=20, b=30}
{id=1002, a=40, b=50}
我可以使用哪些 Pig 命令来执行此操作?
看起来您正在旋转,类似于Pivoting in Pig。但是你已经有一袋元组了。进行内部连接将是昂贵的,因为它会导致额外的 Map Reduce Jobs。要快速做到这一点,您需要在嵌套的 foreach 中进行过滤。修改后的代码将类似于:
inpt = load '..../pig/bag_pivot.txt' as (id : int, b:bag{tuple:(id : int, count : int, key : chararray)});
result = foreach inpt {
col1 = filter b by key == 'a';
col2 = filter b by key == 'b';
generate id, flatten(col1.count) as a, flatten(col2.count) as b;
};
样本输入数据:
1001 {(1001,20,a),(1001,30,b)}
1002 {(1002,40,a),(1001,50,b)}
输出:
(1001,20,30)
(1002,40,50)
不确定你的起始关系的格式是什么,但对我来说它看起来像 (int, bag:{tuple:(int,int,chararray)})?如果是这样,这应该工作:
flattened = FOREACH x GENERATE $0 AS id, flatten($1) AS (idx:int, count:int, key:chararray);
a = FILTER flattened BY key == 'a';
b = FILTER flattened BY key == 'b';
joined = JOIN a BY id, b BY id;
result = FOREACH joined GENERATE a::id AS id, a::count AS a, b::count AS b;