javascript - 是 pjscrape 慢，还是 PhantomJS？替代刮刀？

Question

我刚刚为 pjscrape 编写了我的第一个脚本，但我发现它运行得非常慢。我对 pjscrape 和 phantomjs 都是新手，所以我不知道哪一个是罪魁祸首。

我是从 localhost 加载文件，所以瓶颈肯定不在传输中。

我的config.js脚本如下所示：

pjs.addSuite({
    url: 'http://localhost/file.html'.
    scraper: function() {
        var people = $('table.person');
        var results = [];

        $.each(people, function() {
            var $this = $(this);
            results.push({ 
                firstName: $this.find('.firstName').text(),
                lastName: $this.find('.lastName').text(),
                age: $this.find('.age').text()
            });
        }

        return results;

    }
}

然后我在这里使用命令行指令执行 PhantomJS 。

~> phantomjs pjscrape.js config.js

我在 Chrome 中运行相同的代码（只是刮板函数（）），它是即时的。在 phantomjs/pjscrape 中，它需要 30 秒。

任何线索是什么导致缓慢？

有没有更好的方法来做这个 DOM 屏幕抓取？也许是一个nodejs解决方案？

score 2 · Accepted Answer

如果 Node.JS 是一个选项，我可以向您介绍Cheerio吗？它是一个很棒的库，用于处理格式有问题的 HTML 文档。它为您提供了一个类似 jQuery 的 API，用于处理您正在抓取的页面的类似 DOM 的表示。与request配合使用，它为抓取 HTML 提供了一个非常简单的环境。

您的示例最终会看起来像这样（错误处理留给读者练习）：

var cheerio = require("cheerio"),
    request = require("request");

request("http://localhost/file.html", function(err, res, data) {
  var $ = cheerio.load(data);

  var people = $('table.person');
  var results = [];

  $.each(people, function() {
    var $this = $(this);

    results.push({ 
      firstName: $this.find('.firstName').text(),
      lastName: $this.find('.lastName').text(),
      age: $this.find('.age').text()
    });
  }

  do_something_with(results);
});

score 1 · Accepted Answer

如果您使用的网页发送完整格式的 HTML 并且不需要客户端 javascript 将 DOM 操作为最终形式，请跳过 phantomjs 并使用 http 客户端库（节点核心或请求或superagent或hyperquest）和使用cheerio从DOM 中提取您需要的数据。

javascript - 是 pjscrape 慢，还是 PhantomJS？替代刮刀？

2 回答 2

Related

Reference