0

尝试使用 Open Refine 分析杂乱的 JSON 字符串数据集(40k 行),但是由于 JSON 的无序性质,一些 JSON 对象的行在返回并记录到文件时会混淆。

有些对象缺少键,有些对象的顺序不正确。例子:

1   {"about":"foo", "category":"bar", "id":"123", "cat_list": ["category1":"foo2"]}
2   {"id":"22","about":"barFoo", "category":"NotABar"}
3   {"about":"barbar", "category":"website", "id":"3333", "cat_list": ["category1":"foo22"]}
....
....
....
40,000 {"about":"bar123", "category":"publish", "id":"3323", "cat_list": ""}

问题:

将数据导入 Open Refine 后,程序会在读取文件时要求与特定模式进行比较。然后它读取提供的文件,将行上的每个 JSON 对象与模式进行比较,并根据它与模式的匹配程度来导入或丢弃!结果很多条目被遗漏了!

理想情况下:

使用 Python,我想将 JSON 对象重新排序为我指定的特定模式。

例子:

指定架构

{"about":"", "category":"", "id":"", "cat_list": ""}

然后将 JSON 的每一行及其键值重新排列为这种特定格式:

1   {"about": ....
2   {"about": ....
3   {"about": ....
....
....
....
40,000 {"about": ....

我不完全确定如何有效地做到这一点?

编辑:

我决定只写一个脚本来组织这个。我删除了一些复杂的字段并拥有一个完整的 .JSON 文件:

{"name":"Carstar Bridgewater", 
"category":"Automotive", 
"about":"We are Bridgewaters largest professional collision centre and are committed to being there for customer cars and communities when they need us.", 
"country":"Canada", 
"state":"NS", 
"city":"Bridgewater
"}, 
{"name":"Febreze", 
"category":"Product/Service
", 
"about":"Freshness that eliminates odorsso you can breathe happy.", 
"country":"Added Nothing", 
"state":"Added Nothing", 
"city":"Added Nothing"},
{"name":"Custom Wood & Acrylic Turnings", 
"category":"Professional Services", 
"about":"Hand crafted item turned on a wood lath pen pencil bottle stopper cork screw bottle opener perfume applicator or other custom turnings", 
"country":"Canada", 
"state":"NS
", 
"city":"Middle Sackville"},
{"name":"The Hunger Games", 
"category":"Movie
", 
"about":"THE HUNGER GAMES: MOCKINGJAY - PART 1 - In theatres November 2 2014. www.hungergamesmovie.ca", 
"country":"Added Nothing", 
"state":"Added Nothing", 
"city":"Added Nothing"},

然而。Google-Refine 仍然拒绝接受我的文件?我做错了什么?

4

2 回答 2

0

不确定你是否解决了这个问题。

JSON 需要有效才能成功导入 - 目前您在上面 Q 中发布的文本无法使用http://jsonlint.com等工具进行验证。

将其导入 OpenRefine(又名 Google Refine)时遇到的问题是 JSON 对象必须位于数组中:

[{"name":"Carstar Bridgewater", 
"category":"Automotive", 
"about":"We are Bridgewaters largest professional collision centre and are committed to being there for customer cars and communities when they need us.", 
"country":"Canada", 
"state":"NS", 
"city":"Bridgewater"},
{"name":"Febreze", 
"category":"Product/Service", 
"about":"Freshness that eliminates odorsso you can breathe happy.", 
"country":"Added Nothing", 
"state":"Added Nothing", 
"city":"Added Nothing"},
{"name":"Custom Wood & Acrylic Turnings", 
"category":"Professional Services", 
"about":"Hand crafted item turned on a wood lath pen pencil bottle stopper cork screw bottle opener perfume applicator or other custom turnings", 
"country":"Canada", 
"state":"NS", 
"city":"Middle Sackville"},
{"name":"The Hunger Games", 
"category":"Movie", 
"about":"THE HUNGER GAMES: MOCKINGJAY - PART 1 - In theatres November 2 2014. www.hungergamesmovie.ca", 
"country":"Added Nothing", 
"state":"Added Nothing", 
"city":"Added Nothing"}]

我可以成功地将此处发布的此 JSON 导入 OpenRefine,它工作正常 - 屏幕截图:

在此处输入图像描述 在此处输入图像描述

于 2016-07-07T14:12:53.070 回答
0

“将数据导入 Open Refine 时,程序会要求一个特定的模式与它读取文件时进行比较。”

这听起来像是它意外地将其检测为 XML 而非 JSON 甚至 Lines。

但是,您可以选择要使用的导入器(例如基于 Line 或 JSON),而不仅仅是 OpenRefine 尝试猜测并且有时会出错的自动选择导入器。

在我看来,您可能正在处理即将推出的新“JSON 行”或“换行分隔的 JSON”格式,例如此处记录:http: //jsonlines.org/

我们有一个未解决的问题,最终要向 OpenRefine 添加 JSON Lines 支持:https ://github.com/OpenRefine/OpenRefine/issues/1135

同时,请查看jsonlines.org 站点的 On the Web部分,以获取工具支持来帮助您满足您的需求。

于 2016-06-15T19:03:25.203 回答