0

I'm trying to write a scraper using 'request' and 'cheerio'. I have an array of 100 urls. I'm looping over the array and using 'request' on each url and then doing cheerio.load(body). If I increase i above 3 (i.e. change it to i < 3 for testing) the scraper breaks because var productNumber is undefined and I can't call split on undefined variable. I think that the for loop is moving on before the webpage responds and has time to load the body with cheerio, and this question: nodeJS - Using a callback function with Cheerio would seem to agree.

My problem is that I don't understand how I can make sure the webpage has 'loaded' or been parsed in each iteration of the loop so that I don't get any undefined variables. According to the other answer I don't need a callback, but then how do I do it?

for (var i = 0; i < productLinks.length; i++) {
    productUrl = productLinks[i];
    request(productUrl, function(err, resp, body) {
        if (err)
            throw err;
        $ = cheerio.load(body);
        var imageUrl = $("#bigImage").attr('src'),
            productNumber = $("#product").attr('class').split(/\s+/)[3].split("_")[1]
        console.log(productNumber);

    });
};

Example of output:

1461536
1499543

TypeError: Cannot call method 'split' of undefined
4

2 回答 2

1

由于您没有$为每次迭代创建新变量,因此当请求完成时它会被覆盖。这可能导致未定义的行为,其中循环的一个迭代正在使用$,就像它被另一个迭代覆盖一样。

所以尝试创建一个新变量:

var $ = cheerio.load(body);
^^^ this is the important part

此外,您正确假设循环在请求完成之前继续(在您的情况下,这不是cheerio.load异步的,而是异步request的)。这就是异步 I/O 的工作原理。

要协调异步操作,您可以使用例如async模块;在这种情况下,async.eachSeries可能很有用。

于 2013-09-29T10:34:36.617 回答
0

您正在抓取一些外部站点。您不能确定所有 HTML 都符合完全相同的结构,因此您需要对遍历它的方式保持警惕。

var product = $('#product');
if (!product) return console.log('Cannot find a product element');
var productClass = product.attr('class');
if (!productClass) return console.log('Product element does not have a class defined');
var productNumber = productClass.split(/\s+/)[3].split("_")[1];
console.log(productNumber);

这将帮助您调试哪里出了问题,并且可能表明您无法像希望的那样轻松地抓取数据集。

于 2013-09-29T09:33:32.637 回答