听起来你基本上只是想做一个连接(从问题中不清楚这应该是 INNER、LEFT、RIGHT 还是 FULL。我认为@SNeumann 基本上有写答案,但我会添加一些代码以使其更清晰.
假设数据如下所示:
data1 = 'item1' 111 { ('thing1', 222, {('value1'),('value2')}) }
...
data2 = 'value1' 'result1'
'value2' 'result2'
...
我会做类似(未经测试)的事情:
A = load 'data6' as ( item:chararray, d:int, things:bag{(thing:chararray, d1:int, values:bag{(v:chararray)})} );
B = load 'data7' as ( v:chararray, r:chararray );
A_flattened = FOREACH A GENERATE item, d, things.thing AS thing; things.d1 AS d1, FLATTEN(things.values) AS value;
--This looks like:
--'item1', 111, 'thing1', 222, 'value1'
--'item1', 111, 'thing1', 222, 'value2'
A_B_joined = JOIN A_flattened BY value, B BY v;
--This looks like:
--'item1', 111, 'thing1', 222, 'value1', 'value1', 'result1'
--'item1', 111, 'thing1', 222, 'value1', 'value2', 'result2'
A_B_joined1 = FOREACH A_B_JOINED GENERATE item, d, thing, d1, A_flattened::value AS value, r AS result;
A_B_grouped = GROUP A_B_joined1 BY (value, result);
从那里开始,随心所欲地重新装袋应该是微不足道的。
编辑:上面应该使用'。' 作为元组上的投影运算符。我已经把它换了。它还假设things
是一个大元组,但事实并非如此。这是一袋一件物品。如果 OP 从不打算在该包中包含多个项目,我强烈建议使用元组代替并加载为:
A = load 'data1' as (item:chararray, d:int, things:(thing:chararray, d1:int, values:bag{(v:chararray)}));
然后基本上按原样使用其余代码(注意:仍未测试)。
things
如果绝对需要一个袋子,那么整个问题就会改变,并且当袋子中有多个对象时,OP 想要发生什么变得不清楚。如here所述,袋子投影也相当复杂