2

I'm trying to scrape a page that uses infinite scroll using phantomjs casperjs and spooky. It is supposed to continue clicking the more button and taking the new links from the results until its stopped manually. The script however starts using more and more memory until it crashes. I wrote the following script, is there a way to optimise it so it won't use as much memory:

function pressMore(previousLinksLength) {
    this.click('#projects > div.container-flex.px2 > div > a');
    this.wait(1000, function() {
      links = this.evaluate(function() {
        var projectPreview = document.querySelectorAll('.project-thumbnail a');
        return Array.prototype.map.call(projectPreview, function(e) {
          return e.getAttribute('href');
        });
    });
      this.emit('sendScrapedLinks', links.slice(previousLinksLength));
    // repeat scrape function
      pressMore.call(this, links.length);
  });
}
// spookyjs starts here
spooky.start(scrapingUrl);

//press the more button
spooky.then(pressMore);

spooky.run();
4

1 回答 1

1

我也在无限滚动网站上遇到过这个问题。我永远无法绕过内存泄漏。

简而言之,我最终做的是使用滚动到。基本上我会运行应用程序一段时间,记录最后滚动到的位置,然后使用记录的值重新启动应用程序,以防止内存变高。这很痛苦,因为您必须按顺序滚动到某个位置才能加载越来越多的网站。找到这些位置来划分您最后滚动到的位置可能具有挑战性。

于 2014-09-09T12:15:33.753 回答