python - 使用 Python 将 JSON 文件分成相等/更小的部分

Question

我目前正在做一个项目，我在 Twitter 帖子中使用情绪分析。我用 Sentiment140 对推文进行分类。使用该工具，我每天最多可以对 1,000,000 条推文进行分类，并且我收集了大约 750,000 条推文。所以应该没问题。唯一的问题是我一次最多可以向 JSON 批量分类发送 15,000 条推文。

我的整个代码已设置并正在运行。唯一的问题是我的 JSON 文件现在包含所有 750,000 条推文。

因此我的问题是：将 JSON 拆分为具有相同结构的较小文件的最佳方法是什么？我更喜欢在 Python 中执行此操作。

我考虑过遍历文件。但是我如何在代码中指定它应该在例如 5,000 个元素之后创建一个新文件？

我很想得到一些提示，什么是最合理的方法。谢谢！

编辑：这是我目前拥有的代码。

import itertools
import json
from itertools import izip_longest

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

# Open JSON file
values = open('Tweets.json').read()
#print values

# Adjust formatting of JSON file
values = values.replace('\n', '')    # do your cleanup here
#print values

v = values.encode('utf-8')
#print v

# Load JSON file
v = json.loads(v)
print type(v)

for i, group in enumerate(grouper(v, 5000)):
    with open('outputbatch_{}.json'.format(i), 'w') as outputfile:
        json.dump(list(group), outputfile)

输出给出：

["data", null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, ...]

在一个名为：“outputbatch_0.json”的文件中

编辑 2：这是 JSON 的结构。

{
"data": [
{
"text": "So has @MissJia already discussed this Kelly Rowland Dirty Laundry song? I ain't trying to go all through her timelime...",
"id": "1"
},
{
"text": "RT @UrbanBelleMag: While everyone waits for Kelly Rowland to name her abusive ex, don't hold your breath. But she does say he's changed: ht\u00e2\u20ac\u00a6",
"id": "2"
},
{
"text": "@Iknowimbetter naw if its weak which I dont think it will be im not gonna want to buy and up buying Kanye or even Kelly Rowland album lol",
"id": "3"}
]
}

score 8 · Accepted Answer

使用迭代分组器；itertools模块配方列表包括以下内容：

from itertools import izip_longest

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

这使您可以以 5000 条为一组迭代您的推文：

for i, group in enumerate(grouper(input_tweets, 5000)):
    with open('outputbatch_{}.json'.format(i), 'w') as outputfile:
        json.dump(list(group), outputfile)

score 0 · Accepted Answer

我认为你的第一个想法是好的。只需遍历您获得的所有推文，将它们保存在一个临时数组中并跟踪您每条推文递增一的索引。当当前索引模 5000 等于 0 时，总是调用一个方法，将推文转换为字符串格式，并将其保存在文件名中包含索引的文件中。如果您到达推文的结尾，请在最后一次休息时做同样的事情。

python - 使用 Python 将 JSON 文件分成相等/更小的部分

2 回答 2

Related

Reference