2

我有一些 json 数据格式以 SequenceFile 格式保存到 S3 中secor。我想用 Pig 来分析它。使用elephant-bird我设法以bytearray格式从 S3 获取它,但我无法将其转换为chararray,这显然是解析 Json 所必需的:

%declare SEQFILE_LOADER 'com.twitter.elephantbird.pig.load.SequenceFileLoader';
%declare LONG_CONVERTER 'com.twitter.elephantbird.pig.util.LongWritableConverter';
%declare BYTES_CONVERTER 'com.twitter.elephantbird.pig.util.BytesWritableConverter';
%declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter';

grunt> A = LOAD 's3n://...logs/raw_logs/...events/dt=2015-12-08/1_0_00000000000085594299'
       USING $SEQFILE_LOADER ('-c $LONG_CONVERTER', '-c $BYTES_CONVERTER')
       AS (key: long, value: bytearray);
grunt> B = LIMIT A 1;
grunt> DUMP B;

(85653965,{"key": "val1", other json data, ...})

grunt> DESCRIBE B;

B: {key: long,value: bytearray}

grunt> C = FOREACH B GENERATE (key, (chararray)value);
grunt> DUMP C;

2015-12-08 19:32:09,133 [main] ERROR org.apache.pig.tools.grunt.Grunt -
   ERROR 1075: Received a bytearray from the UDF or Union from two different Loaders.
   Cannot determine how to convert the bytearray to string.

使用TextConverterinsted of the BytesWritableConverterjust 会给我留下空值,例如:

(85653965,)

很明显,Pig 能够将字节数组转换为字符串来转储它,所以这似乎不是不可能的。我怎么做?

4

0 回答 0