python - 使用 Ijson 将非常大的 json 转换为 sql

Question

我有 60GB 的 json 文件，我想用 ijson 将它转换为 sql。（我尝试了很多软件和工具，它们没有用，我的系统崩溃了。

注意：这不是重复的！我看到了关于这个方法的所有代码并写了这个，但我的代码仍然很慢。

这是我的代码：

import json
import sqlite3
import ijson 
import threading


def solve():
    with sqlite3.connect("filename.db") as conn:
        try:
            conn.execute('''
            CREATE TABLE IF NOT EXISTS mytable(
            username       VARCHAR(225) 
            ,phone         INTEGER   
            ,id            INTEGER PRIMARY KEY

            );''')
        except :
            pass
        keys = ["username", "phone" ,  "id"]
        with open('VERYBIGJSON.json' , 'r' , encoding='utf-8') as json_file:
            data = ijson.parse(json_file ,  multiple_values=True)

            for prefix, event, value in data:
                if str(event) == 'start_map':
                    _u = None

                if prefix == "id" :
                    _id = value

                if prefix == "username":
                    _u = value

                if prefix == "phone":
                    _p = value

                try:
                    if  str(event) == 'end_map' :
                        values = (_u , _p , _id)
                        cmd = """INSERT INTO mytable (username , phone , id)  VALUES(
                                    ?,
                                    ?,
                                    ?

                                );"""
                        conn.execute(cmd, values)



                        conn.commit()

                except Exception as e:
                    #print (str(e))
                    continue

if __name__ == '__main__':
    t=[]
    for i in range(1000):
        t.append(threading.Thread(target=solve))


    for i in t: 
        i.start() 

    for i in t:
        i.join()

我测试了多线程和多处理方法，但我的代码仍然运行非常缓慢！（每秒只会生成 10KB 的我的 sql 数据库）。

我想非常有效地做到这一点。

我的 json 样本是：

{"message":"{\"_\":\"user\",\"delete\":{\"test\":true},\"flags\":2067,\"id\":11111110,\"phone\":\"xxxxxxxxxx\",\"photo\":{\"_\":\"userProfilePhoto\",\"photo_id\":\"xxxxxxxxxx\",\"photo_small\":{\"_\":\"file\",\"dcs\":4,\"volume_id\":\"8788701\",\"local_id\":385526},\"photo\":{\"_\":\"file\",\"dcs\":4,\"local_id\":385528}},\"status\":{\"_\":\"userStat\",\"online\":1566173427}}","phone":"xxxxxxxxxx","@version":"1","id":11111110}
{"index": {"_type": "_doc", "_id": "-Lcy4m8BAObvGO9GAsFa"}}

……

请给我一个提高代码速度的想法。

更新： 根据评论，我编写了一个将我的大 json 文件转换为 .csv 的代码，但它仍然比方法 1 慢但更快！

这是代码：

import json
import sqlite3
import ijson 
from multiprocessing import Process
import csv
import pandas as pd
from pandas.io.json import json_normalize
from multiprocessing import Process
import threading

def solve():

        with open('VERYBIGJSON.json' , 'r' , encoding='utf-8') as json_file:
            data = ijson.parse(json_file ,  multiple_values=True )

            for prefix, event, value in data:

                if str(event) == 'start_map':

                    _u = None




                if prefix == "id" :
                    _id = value

                if prefix == "username":
                    _u = value

                if prefix == "phone":
                    _p = value



                if  str(event) == 'end_map' :

                            values = [{'username':'{}'.format(_u) , 'id':'{}'.format(_id) , 'phone':'{}'.format(_p)}] #converting to json for add to 'json_normalize'


                            df = json_normalize(values)
                            df.to_csv('test.csv', index=False, mode='a' ,  encoding='utf-8' , header=False)



if __name__ == '__main__':
    solve()

还有 JsonSlicer 库：

def solve():


        with open('VERYBIGJSON.json' , 'r' , encoding='utf-8') as json_file:
            for key   in JsonSlicer(json_file , (None , None) ) :
                print (key)


if __name__ == '__main__':
    solve()

我得到这个错误：

    for key   in JsonSlicer(json_file , (None , None) ) :
RuntimeError: YAJL error: parse error: trailing garbage
          d_from":"telegram_contacts"} {"index": {"_type": "_doc", "_i
                     (right here) ------^

我认为这个库不支持我的 json 文件。

python - 使用 Ijson 将非常大的 json 转换为 sql

0 回答 0

Related

Reference