javascript - 防止Javascript函数因为太多对象而耗尽内存

Question

我正在构建一个nodeJS使用request和cheerio解析 DOM 的网络爬虫。当我使用时node，我相信这更像是一个普遍的javascript问题。

tl; dr - 创建 ~60,000 - 100,000 个对象，用完我所有计算机的 RAM，out of memory节点中出现错误。

这是刮刀的工作原理。它是循环中的循环，我以前从未设计过如此复杂的东西，所以可能有更好的方法来做到这一点。

循环 1：在名为“sitesArr”的数组中创建 10 个对象。每个对象代表一个要抓取的网站。

var sitesArr = [
    {
        name: 'store name',
        baseURL: 'www.basedomain.com',
        categoryFunct: '(function(){ // do stuff })();',
        gender: 'mens', 
        currency: 'USD',
        title_selector: 'h1',
        description_selector: 'p.description'
    },
    // ... x10
]

循环 2：循环通过“sitesArr”。对于每个站点，它通过“请求”进入主页并获取类别链接列表，通常是 30-70 个 URL。在名称为“categories”的数组属性中，将这些 URL 附加到它们所属的当前“sitesArr”对象。

var sitesArr = [
    {
        name: 'store name',
        baseURL: 'www.basedomain.com',
        categoryFunct: '(function(){ // do stuff })();',
        gender: 'mens', 
        currency: 'USD',
        title_selector: 'h1',
        description_selector: 'p.description',
        categories: [
                        {
                            name: 'shoes',
                            url: 'www.basedomain.com/shoes'
                        },{
                            name: 'socks',
                            url: 'www.basedomain.com/socks'
                        } // x 50
                    ]
    },
    // ... x10
]

循环 3：循环遍历每个“类别”。对于每个 URL，它会获取一个产品链接列表并将它们放入一个数组中。通常每个类别约 300-1000 个产品

var sitesArr = [
    {
        name: 'store name',
        baseURL: 'www.basedomain.com',
        categoryFunct: '(function(){ // do stuff })();',
        gender: 'mens', 
        currency: 'USD',
        title_selector: 'h1',
        description_selector: 'p.description',
        categories: [
                        {
                            name: 'shoes',
                            url: 'www.basedomain.com/shoes',
                            products: [
                                'www.basedomain.com/shoes/product1.html',
                                'www.basedomain.com/shoes/product2.html',
                                'www.basedomain.com/shoes/product3.html',
                                // x 300
                            ]
                        },// x 50
                    ]
    },
    // ... x10
]

循环 4：遍历每个“产品”数组，转到每个 URL 并为每个 URL 创建一个对象。

var product = {
    infoLink: "www.basedomain.com/shoes/product1.html",
    description: "This is a description for the object",
    title: "Product 1",
    Category: "Shoes",
    imgs: ['http://foo.com/img.jpg','http://foo.com/img2.jpg','http://foo.com/img3.jpg'],
    price: 60,
    currency: 'USD'
}

然后，对于每个产品对象，我将它们发送到一个 MongoDB 函数中，该函数会upsert进入我的数据库

问题

这一切都很好，直到过程变大。每次运行此脚本时，我都会创建大约 60,000 个产品对象，并且过了一会儿，我所有的计算机 RAM 都被用完了。更重要的是，在我的过程进行到一半后，我收到以下错误Node：

 FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory

我非常认为这是一个代码设计问题。一旦我完成了这些对象，我应该“删除”它们吗？解决这个问题的最佳方法是什么？

javascript - 防止Javascript函数因为太多对象而耗尽内存

0 回答 0

Related

Reference