web-crawler - 你如何使用 PhantomJS 进行爬虫

Question

我正在尝试利用 PhantomJS 和蜘蛛整个域。我想从根域开始，例如 www.domain.com - 拉取所有链接（a.href），然后获取每个新链接，如果它们没有被抓取或在队列中添加新链接到队列.

想法，帮助？

提前致谢！

score 20 · Accepted Answer

您可能有兴趣查看Pjscrape（免责声明：这是我的项目），这是一个构建在 PhantomJS 之上的开源抓取库。它内置了对爬虫页面的支持，并在进程中从每个页面中抓取信息。你可以爬取整个网站，查看每个锚链接，使用如下短脚本：

pjs.addSuite({
    url: 'http://www.example.com/your_start_page.html',
    moreUrls: function() {
        // get all URLs from anchor links,
        // restricted to the current domain by default
        return _pjs.getAnchorUrls('a');
    },
    scraper: function() {
        // scrapers can use jQuery
        return $('h1').first().text();
    }
});

默认情况下，这将跳过已经被爬取的页面，并且只关注当前域上的链接，尽管这些都可以在您的设置中进行更改。

score 6 · Accepted Answer

这是一个老问题，但要更新，一个很棒的现代答案是http://www.nightmarejs.org/（github：https://github.com/segmentio/nightmare）

从他们的主页上引用一个引人注目的例子：

原始幻影：

phantom.create(function (ph) {
  ph.createPage(function (page) {
    page.open('http://yahoo.com', function (status) {
      page.evaluate(function () {
        var el =
          document.querySelector('input[title="Search"]');
        el.value = 'github nightmare';
      }, function (result) {
        page.evaluate(function () {
          var el = document.querySelector('.searchsubmit');
          var event = document.createEvent('MouseEvent');
          event.initEvent('click', true, false);
          el.dispatchEvent(event);
        }, function (result) {
          ph.exit();
        });
      });
    });
  });
});

与噩梦：

new Nightmare()
  .goto('http://yahoo.com')
  .type('input[title="Search"]', 'github nightmare')
  .click('.searchsubmit')
  .run();

score 3 · Accepted Answer

首先，选择索引页面上的所有锚点并列出 href 值。您可以使用 PhantomJS 的文档选择器或 jQuery 选择器来执行此操作。然后对于每个页面，做同样的事情，直到页面不再包含任何新链接。您应该拥有所有链接的主列表和每个页面的链接列表，以便能够确定链接是否已被处理。你可以把网络爬取想象成一棵树。树的根节点是索引页面，子节点是从索引页面链接的页面。每个子节点可以有一个或多个子节点，具体取决于子页面包含的链接。我希望这有帮助。

web-crawler - 你如何使用 PhantomJS 进行爬虫

3 回答 3

Related

Reference