尝试使用 Open Refine 分析杂乱的 JSON 字符串数据集(40k 行),但是由于 JSON 的无序性质,一些 JSON 对象的行在返回并记录到文件时会混淆。
有些对象缺少键,有些对象的顺序不正确。例子:
1 {"about":"foo", "category":"bar", "id":"123", "cat_list": ["category1":"foo2"]}
2 {"id":"22","about":"barFoo", "category":"NotABar"}
3 {"about":"barbar", "category":"website", "id":"3333", "cat_list": ["category1":"foo22"]}
....
....
....
40,000 {"about":"bar123", "category":"publish", "id":"3323", "cat_list": ""}
问题:
将数据导入 Open Refine 后,程序会在读取文件时要求与特定模式进行比较。然后它读取提供的文件,将行上的每个 JSON 对象与模式进行比较,并根据它与模式的匹配程度来导入或丢弃!结果很多条目被遗漏了!
理想情况下:
使用 Python,我想将 JSON 对象重新排序为我指定的特定模式。
例子:
指定架构
{"about":"", "category":"", "id":"", "cat_list": ""}
然后将 JSON 的每一行及其键值重新排列为这种特定格式:
1 {"about": ....
2 {"about": ....
3 {"about": ....
....
....
....
40,000 {"about": ....
我不完全确定如何有效地做到这一点?
编辑:
我决定只写一个脚本来组织这个。我删除了一些复杂的字段并拥有一个完整的 .JSON 文件:
{"name":"Carstar Bridgewater",
"category":"Automotive",
"about":"We are Bridgewaters largest professional collision centre and are committed to being there for customer cars and communities when they need us.",
"country":"Canada",
"state":"NS",
"city":"Bridgewater
"},
{"name":"Febreze",
"category":"Product/Service
",
"about":"Freshness that eliminates odorsso you can breathe happy.",
"country":"Added Nothing",
"state":"Added Nothing",
"city":"Added Nothing"},
{"name":"Custom Wood & Acrylic Turnings",
"category":"Professional Services",
"about":"Hand crafted item turned on a wood lath pen pencil bottle stopper cork screw bottle opener perfume applicator or other custom turnings",
"country":"Canada",
"state":"NS
",
"city":"Middle Sackville"},
{"name":"The Hunger Games",
"category":"Movie
",
"about":"THE HUNGER GAMES: MOCKINGJAY - PART 1 - In theatres November 2 2014. www.hungergamesmovie.ca",
"country":"Added Nothing",
"state":"Added Nothing",
"city":"Added Nothing"},
然而。Google-Refine 仍然拒绝接受我的文件?我做错了什么?