I have some very large (> 500MB) JSON files that I need to map to a new format and upload to a new DB.
The old format:
{
id: '001',
timestamp: 2016-06-02T14:10:53Z,
contentLength: 123456,
filepath: 'original/...',
size: 'original'
},
{
id: '001',
timestamp: 2016-06-02T14:10:53Z,
contentLength: 24565,
filepath: 'medium/...',
size: 'medium'
},
{
id: '001',
timestamp: 2016-06-02T14:10:53Z,
contentLength: 5464,
filepath: 'small/...',
size: 'small'
}
The new format:
{
Id: '001',
Timestamp: 2016-06-02T14:10:53Z,
OriginalSize: {
ContentLength: 123456,
FilePath: 'original/...'
},
MediumSize: {
ContentLength: 24565,
FilePath: 'medium/...'
},
SmallSize: {
ContentLength: 5464,
FilePath: 'small/...'
}
}
I was achieving this with small datasets like this, processing the 'original' size first:
let out = data.filter(o => o.size === 'original).map(o => {
return {
Id: o.id,
Timestamp: o.timestamp,
OriginalSize: {
ContentLength: o.contentLength,
FilePath: o.filepath
}
};
});
data.filter(o => o.size !== 'original').forEach(o => {
let orig = out.find(function (og) {
return og.Timestamp === o.timestamp;
});
orig[o.size + 'Size'] = {
ContentLength: o.contentLength,
FilePath: o.filepath
};
)
// out now contains the correctly-formatted objects
The problem comes with the very large datasets, where I can't load the hundreds of megabytes of JSON into memory all at once. This seems like a great time to use streams, but of course if I read the file in chunks, running .find() on a small array to find the 'original' size won't work. If I scan through the whole file to find originals and then scan through again to add the other sizes to what I've found, I end up with the whole dataset in memory anyway.
I know of JSONStream, which would be great if I was doing a simple 1-1 remapping of my objects.
Surely I can't be the first one to run into this kind of problem. What solutions have been used in the past? How can I approach this?