2

我正在使用在 VSCode 编辑器中看起来像这样的 jsonl 文件:

第一个.jsonl

1.{"ConnectionTime": 730669.644775033,"objectId": "eHFvTUNqTR","CustomName": "Relay Controller","FirmwareRevision": "FW V1.96","DeviceID": "F1E4746E-DCEC-495B-AC75-1DFD66527561","PeripheralType": 9,"updatedAt": "2016-12-13T15:50:41.626Z","Model": "DF Bluno","HardwareRevision": "HW V1.7","Serial": "0123456789","createdAt": "2016-12-13T15:50:41.626Z","Manufacturer": "DFRobot"}
2.{"ConnectionTime": 702937.7616419792, "objectId": "uYuT3zgyez", "CustomName": "Relay Controller", "FirmwareRevision": "FW V1.96", "DeviceID": "F1E4746E-DCEC-495B-AC75-1DFD66527561", "PeripheralType": 9, "updatedAt": "2016-12-13T08:08:29.829Z", "Model": "DF Bluno", "HardwareRevision": "HW V1.7", "Serial": "0123456789", "createdAt": "2016-12-13T08:08:29.829Z", "Manufacturer": "DFRobot"}
3.
4.
5.
6.

第二个.jsonl

1.{"ConnectionTime": 730669.644775033,"objectId": "eHFvTUNqTR","CustomName": "Relay Controller","FirmwareRevision": "FW V1.96","DeviceID": "F1E4746E-DCEC-495B-AC75-1DFD66527561","PeripheralType": 9,"updatedAt": "2016-12-13T15:50:41.626Z","Model": "DF Bluno","HardwareRevision": "HW V1.7","Serial": "0123456789","createdAt": "2016-12-13T15:50:41.626Z","Manufacturer": "DFRobot"}
2.{"ConnectionTime": 702937.7616419792, "objectId": "uYuT3zgyez", "CustomName": "Relay Controller", "FirmwareRevision": "FW V1.96", "DeviceID": "F1E4746E-DCEC-495B-AC75-1DFD66527561", "PeripheralType": 9, "updatedAt": "2016-12-13T08:08:29.829Z", "Model": "DF Bluno", "HardwareRevision": "HW V1.7", "Serial": "0123456789", "createdAt": "2016-12-13T08:08:29.829Z", "Manufacturer": "DFRobot"}
3.
4.

然后还有更多,具有随机数量的结束线/ EOF 标记。我想在每个文件的末尾有单行或空行。raise JSONDecodeError("Expecting value", s, err.value) from Nonejson.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 1) 使用这种方法,我不断收到此错误:

filenames = glob.glob("folder_with_all_jsonl/*.jsonl")

#read file by file, write file by file. Simple.

for f in filenames:
#path to the jsonl file/s 
    data_json = io.open(f, mode='r', encoding='utf-8-sig') # Opens in the JSONL file
    data_python = extract_json(data_json)
#.....code omitted
    for line in data_python: # it would fail here because of an empty line
        print(line.get(objectId))
        #and so on

我手动删除了一些额外的行,并且能够处理我的 2 个 jsonl 文件。

我看过这些 SO 板:
1>使用 Python 在 json 文件中删除新的换行符。

2>读取文件时用单个换行符替换多个换行符

请给我提示/帮助。我会很感激的!!

我希望每个文件都采用这种格式:first.jsonl

1.{"ConnectionTime": 730669.644775033,"objectId": "eHFvTUNqTR","CustomName": "Relay Controller","FirmwareRevision": "FW V1.96","DeviceID": "F1E4746E-DCEC-495B-AC75-1DFD66527561","PeripheralType": 9,"updatedAt": "2016-12-13T15:50:41.626Z","Model": "DF Bluno","HardwareRevision": "HW V1.7","Serial": "0123456789","createdAt": "2016-12-13T15:50:41.626Z","Manufacturer": "DFRobot"}
2.{"ConnectionTime": 702937.7616419792, "objectId": "uYuT3zgyez", "CustomName": "Relay Controller", "FirmwareRevision": "FW V1.96", "DeviceID": "F1E4746E-DCEC-495B-AC75-1DFD66527561", "PeripheralType": 9, "updatedAt": "2016-12-13T08:08:29.829Z", "Model": "DF Bluno", "HardwareRevision": "HW V1.7", "Serial": "0123456789", "createdAt": "2016-12-13T08:08:29.829Z", "Manufacturer": "DFRobot"}

编辑: 我使用了正阳宋的回答和 chepner 的建议我实际上有两个 4gb 文件,这样做:

results = []
for f in glob.glob("folder_with_all_jsonl/*.jsonl"):
    with open(f, 'r', encoding='utf-8-sig') as infile:
        for line in infile:
            try:
                results.append(json.loads(line)) # read each line of the file
            except ValueError:
                print(f)
    with open(f,'w', encoding= 'utf-8-sig') as outfile:
        for result in results:
            outfile.write(json.dumps(result) + "\n")

导致错误line 852, in start _start_new_thread(self._bootstrap, ()) RuntimeError: can't start new thread我在我的个人 Windows 机器上。

编辑 2:我迁移到我的工作机器,我能够解决这个问题。任何输入我们如何在个人机器上防止这种情况?像并行处理??

4

1 回答 1

1

只是为了响应您的最后一个代码段。

你可以换行

json.dump(result, outfile, indent=None)

类似于:

for one_item in result:
    outfile.write(json.dumps(one_item)+"\n")
于 2020-06-07T14:10:07.867 回答