python - Pandas df to ndjson 给出不正确的行数

Question

我有一个 320 行的数据框。我用 pandas 将其转换为 ndjson：

df.to_json('file.json', orient='records', lines=True)

然而，在加载数据时，我只获得了 200 行。

with open('file.json') as f:
    print(len(f.readlines()))

给 200

spark.read.json('file.json').count

也给200

只有用 pandas 重新加载它才能给出正确的行数：

pd.read_json('file.json', orient='records', lines=True)

我的数据集包含\n字段中的字符。当我用 python 或 spark 加载记录时，我期望有更多或更多的行。

该方法有什么问题pandas.to_json？

score 0 · Accepted Answer

我手动逐行检查了json文件，发现pandas.to_json好像写错了。（或者我误解了规格）

with open('file.json') as f:
    j = f.read().replace('},{', '}\n{')
with open('file.jsonl', 'w') as f:
    f.write(j)

替换文件中的错误可以解决问题。

1 回答 1