我有一个 300 mb 的 CSV,其中包含来自 Geonames.org 的 300 万行城市信息。我正在尝试将此 CSV 转换为 JSON 以使用 mongoimport 导入 MongoDB。我想要 JSON 的原因是它允许我将“loc”字段指定为数组而不是与地理空间索引一起使用的字符串。CSV 以 UTF-8 编码。
我的 CSV 片段如下所示:
"geonameid","name","asciiname","alternatenames","loc","feature_class","feature_code","country_code","cc2","admin1_code","admin2_code","admin3_code","admin4_code"
3,"Zamīn Sūkhteh","Zamin Sukhteh","Zamin Sukhteh,Zamīn Sūkhteh","[48.91667,32.48333]","P","PPL","IR",,"15",,,
5,"Yekāhī","Yekahi","Yekahi,Yekāhī","[48.9,32.5]","P","PPL","IR",,"15",,,
7,"Tarvīḩ ‘Adāī","Tarvih `Adai","Tarvih `Adai,Tarvīḩ ‘Adāī","[48.2,32.1]","P","PPL","IR",,"15",,,
与 mongoimport 一起使用的所需 JSON 输出(字符集除外)如下:
{"geonameid":3,"name":"Zamin Sukhteh","asciiname":"Zamin Sukhteh","alternatenames":"Zamin Sukhteh,Zamin Sukhteh","loc":[48.91667,32.48333] ,"feature_class":"P","feature_code":"PPL","country_code":"IR","cc2":null,"admin1_code":15,"admin2_code":null,"admin3_code":null,"admin4_code":null}
{"geonameid":5,"name":"Yekahi","asciiname":"Yekahi","alternatenames":"Yekahi,Yekahi","loc":[48.9,32.5] ,"feature_class":"P","feature_code":"PPL","country_code":"IR","cc2":null,"admin1_code":15,"admin2_code":null,"admin3_code":null,"admin4_code":null}
{"geonameid":7,"name":"Tarvi? ‘Adai","asciiname":"Tarvih `Adai","alternatenames":"Tarvih `Adai,Tarvi? ‘Adai","loc":[48.2,32.1] ,"feature_class":"P","feature_code":"PPL","country_code":"IR","cc2":null,"admin1_code":15,"admin2_code":null,"admin3_code":null,"admin4_code":null}
我已经尝试了所有可用的在线 CSV-JSON 转换器,但由于文件大小,它们无法正常工作。我得到的最接近的是Mr Data Converter(如上图所示),它会在删除文档之间的开始和结束括号以及逗号后导入到 MongoDb。不幸的是,该工具不适用于 300 mb 的文件。
上面的 JSON 设置为 UTF-8 编码,但仍有字符集问题,很可能是由于转换错误?
我花了最后三天学习 Python,尝试使用 Python CSVKIT,尝试了 stackoverflow 上的所有 CSV-JSON 脚本,将 CSV 导入 MongoDB 并将“loc”字符串更改为数组(不幸的是,这保留了引号),甚至尝试手动一次复制和粘贴 30,000 条记录。大量的逆向工程、反复试验等等。
有没有人知道如何在保持编码正确的同时实现上面的 JSON,就像上面的 CSV 一样?我完全处于静止状态。