0

我有三种数据类型...

1) 基础数据 2) data_dict_1 3) data_dict_2

基本数据是格式很好的 json .. 例如:

{"id1":"foo", "id2":"bar" ,type:"type1"}
{"id1":"foo", "id2":"bar" ,type:"type2"}

data_dict_1

1 foo
2 bar
3 foobar
....

data_dict_2

-1 foo
-2 bar
-3 foobar
... and so on

现在,我想要的是.. 如果数据是 type1

然后从 data_dict_1 中读取 id1,从 data_dict2 中读取 id2 并分配该整数 id.. 如果数据是 type2.. 则从 data_dict_2 中读取 id1.. 从 data_dict1.. 中读取 id2.. 并分配相应的 ids.. 例如:

{"id1":1, "id2":2 ,type:"type1"}
{"id1":-1, "id2":-2 ,type:"type2"}

等等..我如何在猪身上做到这一点?

4

1 回答 1

1

注意:上例中的内容不是有效的 json,type密钥没有被引用。

假设 Pig 0.10 及更高版本,内置JsonLoader ,您可以将模式传递给并加载它

data = LOAD 'loljson' USING JsonLoader('id1:chararray,id2:chararray,type:chararray');

并加载字典

dict_1 = LOAD 'data_dict_1' USING PigStorage(' ') AS (id:int, key:chararray);
dict_2 = LOAD 'data_dict_2' USING PigStorage(' ') AS (id:int, key:chararray);

然后根据type值拆分

SPLIT data INTO type1 IF type == 'type1', type2 IF type == 'type2';

JOIN他们适当地

type1_joined = JOIN type1 BY id1, dict_1 BY key;
type1_joined = FOREACH type1_joined GENERATE type1::id1 AS id1, type1::id2 AS id2, type1::type AS type, dict_1::id AS id;

type2_joined = JOIN type2 BY id2, dict_2 BY key;
type2_joined = FOREACH type2_joined GENERATE type2::id1 AS id1, type2::id2 AS id2, type2::type AS type, dict_2::id AS id;

并且由于模式是平等的,UNION他们在一起

final_data = UNION type1_joined, type2_joined;

这会产生

DUMP final_data;

(foo,bar,type2,-2)
(foo,bar,type1,1)
于 2013-10-22T20:52:11.177 回答