python - 严格使用 JSON，如何将 key:values 重新排序为 Open Refine 的特定 JSON 模式

Question

尝试使用 Open Refine 分析杂乱的 JSON 字符串数据集（40k 行），但是由于 JSON 的无序性质，一些 JSON 对象的行在返回并记录到文件时会混淆。

有些对象缺少键，有些对象的顺序不正确。例子：

1   {"about":"foo", "category":"bar", "id":"123", "cat_list": ["category1":"foo2"]}
2   {"id":"22","about":"barFoo", "category":"NotABar"}
3   {"about":"barbar", "category":"website", "id":"3333", "cat_list": ["category1":"foo22"]}
....
....
....
40,000 {"about":"bar123", "category":"publish", "id":"3323", "cat_list": ""}

问题：

将数据导入 Open Refine 后，程序会在读取文件时要求与特定模式进行比较。然后它读取提供的文件，将行上的每个 JSON 对象与模式进行比较，并根据它与模式的匹配程度来导入或丢弃！结果很多条目被遗漏了！

理想情况下：

使用 Python，我想将 JSON 对象重新排序为我指定的特定模式。

例子：

指定架构

{"about":"", "category":"", "id":"", "cat_list": ""}

然后将 JSON 的每一行及其键值重新排列为这种特定格式：

1   {"about": ....
2   {"about": ....
3   {"about": ....
....
....
....
40,000 {"about": ....

我不完全确定如何有效地做到这一点？

编辑：

我决定只写一个脚本来组织这个。我删除了一些复杂的字段并拥有一个完整的 .JSON 文件：

{"name":"Carstar Bridgewater", 
"category":"Automotive", 
"about":"We are Bridgewaters largest professional collision centre and are committed to being there for customer cars and communities when they need us.", 
"country":"Canada", 
"state":"NS", 
"city":"Bridgewater
"}, 
{"name":"Febreze", 
"category":"Product/Service
", 
"about":"Freshness that eliminates odorsso you can breathe happy.", 
"country":"Added Nothing", 
"state":"Added Nothing", 
"city":"Added Nothing"},
{"name":"Custom Wood & Acrylic Turnings", 
"category":"Professional Services", 
"about":"Hand crafted item turned on a wood lath pen pencil bottle stopper cork screw bottle opener perfume applicator or other custom turnings", 
"country":"Canada", 
"state":"NS
", 
"city":"Middle Sackville"},
{"name":"The Hunger Games", 
"category":"Movie
", 
"about":"THE HUNGER GAMES: MOCKINGJAY - PART 1 - In theatres November 2 2014. www.hungergamesmovie.ca", 
"country":"Added Nothing", 
"state":"Added Nothing", 
"city":"Added Nothing"},

然而。Google-Refine 仍然拒绝接受我的文件？我做错了什么？

score 0 · Accepted Answer

不确定你是否解决了这个问题。

JSON 需要有效才能成功导入 - 目前您在上面 Q 中发布的文本无法使用http://jsonlint.com等工具进行验证。

将其导入 OpenRefine（又名 Google Refine）时遇到的问题是 JSON 对象必须位于数组中：

[{"name":"Carstar Bridgewater", 
"category":"Automotive", 
"about":"We are Bridgewaters largest professional collision centre and are committed to being there for customer cars and communities when they need us.", 
"country":"Canada", 
"state":"NS", 
"city":"Bridgewater"},
{"name":"Febreze", 
"category":"Product/Service", 
"about":"Freshness that eliminates odorsso you can breathe happy.", 
"country":"Added Nothing", 
"state":"Added Nothing", 
"city":"Added Nothing"},
{"name":"Custom Wood & Acrylic Turnings", 
"category":"Professional Services", 
"about":"Hand crafted item turned on a wood lath pen pencil bottle stopper cork screw bottle opener perfume applicator or other custom turnings", 
"country":"Canada", 
"state":"NS", 
"city":"Middle Sackville"},
{"name":"The Hunger Games", 
"category":"Movie", 
"about":"THE HUNGER GAMES: MOCKINGJAY - PART 1 - In theatres November 2 2014. www.hungergamesmovie.ca", 
"country":"Added Nothing", 
"state":"Added Nothing", 
"city":"Added Nothing"}]

我可以成功地将此处发布的此 JSON 导入 OpenRefine，它工作正常 - 屏幕截图：

score 0 · Accepted Answer

“将数据导入 Open Refine 时，程序会要求一个特定的模式与它读取文件时进行比较。”

这听起来像是它意外地将其检测为 XML 而非 JSON 甚至 Lines。

但是，您可以选择要使用的导入器（例如基于 Line 或 JSON），而不仅仅是 OpenRefine 尝试猜测并且有时会出错的自动选择导入器。

在我看来，您可能正在处理即将推出的新“JSON 行”或“换行分隔的 JSON”格式，例如此处记录：http: //jsonlines.org/

我们有一个未解决的问题，最终要向 OpenRefine 添加 JSON Lines 支持：https ://github.com/OpenRefine/OpenRefine/issues/1135

同时，请查看jsonlines.org 站点的 On the Web部分，以获取工具支持来帮助您满足您的需求。

python - 严格使用 JSON，如何将 key:values 重新排序为 Open Refine 的特定 JSON 模式

2 回答 2

Related

Reference