不知何故,我得到了 filename.log,例如(制表符分隔)
Name:Peter Age:18
Name:Tom Age:25
Name:Jason Age:35
因为键列的值可能不同,所以当我加载文本时无法定义架构
a = load 'filename.log' as (Name:chararray,Age:int);
我也不想按位置调用列
b = foreach a generate $0,$1;
我想做的是,仅从那个 filename.log 开始,就可以通过键调用每个值,例如
a = load 'filename.log' using PigStorage('\t');
b = group b by Name;
c = foreach b generate group, COUNT(b);
dump c;
为此,我编写了一些Java UDF,它分离键:值并为元组中的每个字段获取值,如下所示
public class SPLITALLGETCOL2 extends EvalFunc<Tuple>{
@Override
public Tuple exec(Tuple input){
TupleFactory mTupleFactory = TupleFactory.getInstance();
ArrayList<String> mProtoTuple = new ArrayList<String>();
Tuple output;
String target=input.toString().substring(1, input.toString().length()-1);
String[] tokenized=target.split(",");
try{
for(int i=0;i<tokenized.length;i++){
mProtoTuple.add(tokenized[i].split(":")[1]);
}
output = mTupleFactory.newTupleNoCopy(mProtoTuple);
return output;
}catch(Exception e){
output = mTupleFactory.newTupleNoCopy(mProtoTuple);
return output;
}
}
}
我应该如何改变这个方法来得到我想要的?或者我应该如何编写其他 UDF 才能到达那里?