json - 在 Pig 中解析复杂的嵌套 JSON

Question

我想将亿万富翁 JSON 数据集解析为 Pig。可以在此处找到 JSON 文件。

以下是每个条目的内容：

{
    "wealth": {
        "worth in billions": 1.2,
        "how": {
             "category": "Resource Related",
             "from emerging": true,
             "industry": "Mining and metals",
             "was political": false,
             "inherited": true,
             "was founder": true
         },
         "type": "privatized and resources"
    },
    "company": {
        "sector": "aluminum",
        "founded": 1993,
        "type": "privatization",
        "name": "Guangdong Dongyangguang Aluminum",
        "relationship": "owner"
        },
    "rank": 1372,
    "location": {
          "gdp": 0.0,
          "region": "East Asia",
          "citizenship": "China",
          "country code": "CHN"
              },
    "year": 2014,
    "demographics": {
              "gender": "male",
              "age": 50
              },
    "name": "Zhang Zhongneng"
}

尝试 1

我尝试在 grunt 中使用以下命令加载这些数据：

亿万富翁 = LOAD 'billionaires.json' USING JsonLoader('wealth: (价值数十亿：双倍，如何：(类别：chararray，来自新兴：chararray，行业：chararray，是政治：chararray，继承：chararray，是创始人：chararray ), type:chararray), company: (sector:chararray,founded:int,type:chararray,name:chararray,relationship:chararray),rank:int,location:(gdp:double,region:chararray,citizenship:chararray,国家代码：chararray），年份：int，人口统计：（性别：chararray，年龄：int），姓名：chararray'）；

然而，这给了我错误：

错误 org.apache.pig.tools.grunt.Grunt - 错误 1200：不匹配的输入 'in' 期望 RIGHT_PAREN

尝试 2

接下来，我尝试使用 Twitter 的大象鸟项目的加载程序，名为com.twitter.elephantbird.pig.load.JsonLoader. 这是此 UDF 的代码。这就是我所做的：

billionaires = LOAD 'billionaires.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json:map[]);
names = foreach billionaires generate json#'name' AS name;
dump names;

现在它运行了，我没有收到任何错误！但是什么都没有显示。我得到如下输出：

输入：成功读取 0 条记录（1445335 字节）来自：“hdfs://localhost:9000/user/purak/billionaires.json”

输出：成功存储 0 条记录在：“hdfs://localhost:9000/tmp/temp-1399280624/tmp-477607570”

计数器：写入的总记录数：0 写入的总字节数：0 Spillable Memory Manager 溢出计数：0 主动溢出的总包数：0 主动溢出的总记录数：0

工作 DAG：job_1478889184960_0005

我在这里做错了什么？

score 0 · Accepted Answer

这可能不是最好的方法，但这就是我最终要做的：

从字段名称中删除空格：我在 json 数据集中用“worth_in_billions”、“from_emerging”等替换了“十亿价值”、“来自新兴”等字段。（我为此做了一个简单的“查找和替换”）
逗号分隔的 json 到换行符分隔的 json：我拥有的 json 文件的格式为[{"_comment":"first entry" ...},{"_comment":"second entry" ...}]. 但是 Pig 中的 JsonLoader 将每个换行符作为一个新条目。为了使 json 文件用换行符而不是逗号分隔，我使用了js，它是一个命令行 JSON 处理器。使用安装它sudo apt-get install js并运行cat billionaires.json | jq -c ".[]" > newBillionaires.json.
newBillionaires.json 文件现在在新行中包含每个条目。现在使用以下命令将此文件加载到 Pig 中：

copyFromLocal /home/purak/Desktop/newBillionaires.json /user/purak

亿万富翁 = LOAD 'newBillionaires.json' USING JsonLoader('name:chararray, 人口统计: (age:int,gender:chararray),year:int,location:(country_code:chararray,citizenship:chararray,region:chararray,gdp:double ),rank:int,company: (relationship:chararray,name:chararray,type:chararray,founded:int,sector:chararray), 财富:(type:chararray,how:(was_founder:chararray,inherited:chararray,was_political: chararray，行业：chararray，from_emerging：chararray，category：chararray），worth_in_biilions：double）'）；

注意：使用 js 颠倒了每个条目中的字段顺序。因此，在加载命令中，与问题中的加载命令相比，所有字段的顺序都是相反的。

您现在可以使用以下方法取消嵌套每个元组：

亿万富翁最终 = foreach 亿万富翁生成名称，人口统计.年龄作为年龄，人口统计.性别作为性别，年份，location.country_code 作为国家代码，location.citizenship 作为公民身份，location.region 作为地区，location.gdp 作为 gdp，排名，公司.关系作为 companyRelationship，company.name 作为 companyName，company.type 作为 companyType，company.founded 作为 companyFounded，company.sector 作为 companySector，财富.type 作为财富类型，财富.how.was_founder 作为 wasFounder，财富.how.inherited 作为继承，财富.how.was_political 为 wasPolitical，weather.how.industry 为行业，weather.how.from_emerging 为 fromEmerging，weather.how.category 为类别，weather.worth_in_biilions 为worthInBillions；

使用以下命令检查结构describe billionairesFinal;：

亿万富翁决赛：{名称：chararray，年龄：int，性别：chararray，年份：int，countryCode：chararray，公民身份：chararray，地区：chararray，gdp：double，rank：int，公司关系：chararray，公司名称：chararray，companyType：chararray ,companyFounded: int,companySector: chararray,wealthType: chararray,wasFounder: chararray,inherited: chararray,wasPolitical: chararray,industry: chararray,fromEmerging: chararray,category: chararray,worthInBillions: double}

这是我在 Pig 中想要的数据结构！现在我可以继续分析数据集了:)

json - 在 Pig 中解析复杂的嵌套 JSON

1 回答 1

Related

Reference