hbase - 如何在 Apache Pig 中加入地图？（存储在 HBase 中）

Question

我对 apache pig 有疑问，不知道如何解决，或者是否可能。我使用 hbase 作为“存储层”。该表如下所示：

row key/column  (b1, c1)        (b2, c2)    ...     (bn, cn)
a1              empty           empty               empty   
a2              ...
an              ...

有 a1 到 an 的行键，每一行都有不同的列，语法为 (bn, cn)。每行/列的值为空。

我的猪程序如下所示：

/* Loading the data */
mydata = load 'hbase://mytable' ... as (a:chararray, b_c:map[]);

/* finding the right elements */ 
sub1 = FILTER mydata BY a == 'a1';
sub2 = FILTER mydata BY a == 'a2');

现在我想加入 sub1 和 sub2，这意味着我想找到同时存在于数据 sub1 和 sub2 中的列。我怎样才能做到这一点？

score 0 · Accepted Answer

地图将无法在纯猪中执行此类操作。因此，您将需要一个 UDF。我不确定你想要得到什么作为连接的输出，但是根据你的需要调整 python UDF 应该是相当容易的。

myudf.py

@outputSchema('cols: {(col:chararray)}')
def join_maps(M1, M2):
    # This literally returns all column names that exist in both maps.
    out = []
    for k,v in M1.iteritems():
        if k in M2 and v is not None and M2[k] is not None:
            out.append(k)
    return out

你可以像这样使用它：

register 'myudf.py' using jython as myudf ;

# We can call sub2 from in sub1 since it only has one row
D = FOREACH sub1 GENERATE myudf.join_maps(b_c, sub2.b_c) ;

hbase - 如何在 Apache Pig 中加入地图？（存储在 HBase 中）

1 回答 1

Related

Reference