0

I have some very large (> 500MB) JSON files that I need to map to a new format and upload to a new DB.

The old format:

{
    id: '001',
    timestamp: 2016-06-02T14:10:53Z,
    contentLength: 123456,
    filepath: 'original/...',
    size: 'original'
},
{
    id: '001',
    timestamp: 2016-06-02T14:10:53Z,
    contentLength: 24565,
    filepath: 'medium/...',
    size: 'medium'
},
{
    id: '001',
    timestamp: 2016-06-02T14:10:53Z,
    contentLength: 5464,
    filepath: 'small/...',
    size: 'small'
}

The new format:

{
    Id: '001',
    Timestamp: 2016-06-02T14:10:53Z,
    OriginalSize: {
        ContentLength: 123456,
        FilePath: 'original/...'
    },
    MediumSize: {
       ContentLength: 24565,
       FilePath: 'medium/...'
    },
    SmallSize: {
        ContentLength: 5464,
        FilePath: 'small/...'
    }
}

I was achieving this with small datasets like this, processing the 'original' size first:

let out = data.filter(o => o.size === 'original).map(o => {
    return {
        Id: o.id,
        Timestamp: o.timestamp,
        OriginalSize: {
            ContentLength: o.contentLength,
            FilePath: o.filepath
        }
    };
});
data.filter(o => o.size !== 'original').forEach(o => {
    let orig = out.find(function (og) {
        return og.Timestamp === o.timestamp;
    });
    orig[o.size + 'Size'] = {
        ContentLength: o.contentLength,
        FilePath: o.filepath
    };
)
// out now contains the correctly-formatted objects

The problem comes with the very large datasets, where I can't load the hundreds of megabytes of JSON into memory all at once. This seems like a great time to use streams, but of course if I read the file in chunks, running .find() on a small array to find the 'original' size won't work. If I scan through the whole file to find originals and then scan through again to add the other sizes to what I've found, I end up with the whole dataset in memory anyway.

I know of JSONStream, which would be great if I was doing a simple 1-1 remapping of my objects.

Surely I can't be the first one to run into this kind of problem. What solutions have been used in the past? How can I approach this?

4

2 回答 2

0

我认为诀窍是即时更新数据库。如果 JSON 文件对于内存来说太大了,那么我希望生成的对象集(out在您的示例中)对于内存来说也太大了。

在评论中,您声明 JSON 文件每行有一个对象。因此使用 node.js 内置fs.createReadStreamreadline获取文本文件的每一行。接下来将行(字符串)处理成 json 对象,最后更新数据库。

解析.js

var readline = require('readline');
var fs = require('fs');

var jsonfile = 'text.json';

var linereader = readline.createInterface({
  input: fs.createReadStream(jsonfile)
});

linereader.on('line', function (line) {
  obj = parseJSON(line); // convert line (string) to JSON object

  // check DB for existing id/timestamp
  if ( existsInDB({id:obj.id, timestamp:obj.timestamp}) ) {
    updateInDB(obj); // already exists, so UPDATE
  }
  else { insertInDB(obj); } // does not exist, so INSERT
});


// DUMMY functions below, implement according to your needs

function parseJSON (str) {
  str = str.replace(/,\s*$/, ""); // lose trailing comma
  return eval('(' + str + ')'); // insecure! so no unknown sources
}
function existsInDB (obj) { return true; }
function updateInDB (obj) { console.log(obj); }
function insertInDB (obj) { console.log(obj); }

文本.json

{ id: '001', timestamp: '2016-06-02T14:10:53Z', contentLength: 123456, filepath: 'original/...', size: 'original' },
{ id: '001', timestamp: '2016-06-02T14:10:53Z', contentLength: 24565, filepath: 'medium/...', size: 'medium' },
{ id: '001', timestamp: '2016-06-02T14:10:53Z', contentLength: 5464, filepath: 'small/...', size: 'small' }

注意:我需要引用时间戳值以避免语法错误。从您的问题和示例脚本中,我希望您要么没有这个问题,要么已经解决了这个问题,也许是另一种方式。

此外,我的实现parseJSON可能与您解析 JSON 的方式不同。JSON.parse由于未引用属性,普通旧的对我来说失败了。

于 2016-06-02T16:34:22.930 回答
0

设置一些可以存储 JSON 文档的数据库实例。MongoDB 或 PostgreSQL(最近他们引入了 jsonb 数据类型来存储 json 文档)。遍历旧的 JSON 文档并将它们组合到新的结构中,使用 DB 作为存储 - 这样您就可以克服内存问题。

我很确定,如果不 a) 降低过程的速度(极大地)或 b) 从头开始​​创建穷人的数据库(这似乎是一件坏事:) ,就没有办法实现你的目标

于 2016-06-02T15:19:34.570 回答