json - 如何在 Pig 中解析 JSON？

Question

我在 s3 中有很多 gzip 的日志文件，它们有 3 种类型的日志行：b、c、i。i 和 c 都是单级 json：

{"this":"that","test":"4"}

类型 b 是深度嵌套的 json。我遇到了这个要点，谈论编译一个罐子来完成这项工作。由于我的 Java 技能不那么出色，所以我真的不知道从这里该做什么。

{"this":{"foo":"bar","baz":{"test":"me"},"total":"5"}}

由于类型 i 和 c 的顺序并不总是相同，这使得在生成正则表达式中指定所有内容变得困难。是否可以使用 Pig 处理 JSON（在 gzip 文件中）？我正在使用基于 Amazon Elastic Map Reduce 实例构建的任何版本的 Pig。

这归结为两个问题：1）我可以用 Pig 解析 JSON（如果可以，如何解析）？2) 如果我可以解析 JSON（来自 gzip 的日志文件），我可以解析嵌套的 JSON 对象吗？

score 17 · Accepted Answer

17

Pig 0.10 带有内置的 JsonStorage 和 JsonLoader()。

用于 json 加载/存储的 pig doc

于 2012-07-03T22:00:44.443 回答

score 8 · Accepted Answer

经过很多变通方法和解决问题后，我能够回答完成这项工作。我在我的博客上写了一篇关于如何做到这一点的文章。它可以在这里找到：http ://eric.lubow.org/2011/hadoop/pig-queries-parsing-json-on-amazons-elastic-map-reduce-using-s3-data/

score 5 · Accepted Answer

Pig 带有一个 JSON 加载器。要加载您，请使用：

A = LOAD 'data.json'<br> 使用 PigJsonLoader();

要存储，您可以使用：

STORE INTO ‘output.json’ 
    USING PigJsonLoader();

但是，我不确定它是否支持 GZIPed 数据......

score 3 · Accepted Answer

3

请试试这个：https ://github.com/ab/elephant-bird

于 2011-12-27T19:45:51.643 回答

score 2 · Accepted Answer

我们可以通过使用 JsonLoader 来做到这一点......但我们必须提及您的 json 数据的架构，否则可能会出现错误......只需点击以下链接

         http://joshualande.com/read-write-json-apache-pig/

我们也可以通过创建 UDF 来解析它......

score 0 · Accepted Answer

我已经看到 twitter 大象鸟的使用量增加了很多，它正在迅速成为 PIG 中 json 解析的 goto 库。

例子：

DEFINE TwitterJsonLoader com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true ');

JsonInput = LOAD 'input_path' USING TwitterJsonLoader() AS (entity: map[]);

InputObjects = FOREACH JsonInput GENERATE (map[]) entity#'Object' AS   JsonObject;

InputIds = FOREACH InputObjects GENERATE JsonObject#'id' AS id;

score 0 · Accepted Answer

您可以尝试使用 twitter 大象 json 加载器，它动态处理 json 数据。但是您必须非常精确地使用模式。

api_data = LOAD '文件名' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');

json - 如何在 Pig 中解析 JSON？

7 回答 7

Related

Reference