web-crawler - 使用 Apify SDK 时有没有办法指定最大爬网深度？

Question

我正在开发一个项目，我正在评估 Scrapy 和 Apify。大多数代码都围绕着 node.js，所以一个 javascript 解决方案会很好。另外，我喜欢我可以在 Apify 中使用 puppeteer 的事实。也就是说，我的用例需要对许多网站进行相当浅的（例如大约 4 次深度）爬网。这在 Scrapy 中很容易配置，但我不知道如何在 Apify 中进行配置。有没有办法在新的 Apify API 中指定最大深度？看起来这是他们旧版爬虫中的一个参数，但我在新 API 中没有找到它。

score 1 · Accepted Answer

您可以采取两种方法。首先，您可以使用Puppeteer Scraper公共参与者，它使您能够以简化的形式使用 Apify SDK 的大部分功能，并且可以在Performance and limits部分下作为简单输入获得最大爬网深度配置。要了解基础知识，请访问介绍教程。

第二种方法涉及更多，直接使用 Apify SDK。对于您的所有请求，您可以使用该request.userData属性向下传递任意用户数据。这样，在将更多页面添加到抓取队列之前，您可以检查是否未达到所需的深度：

const MAX_DEPTH = 4;

// When creating the request queue, we seed the first request with a depth of 0.
const requestQueue = await Apify.openRequestQueue();
await requestQueue.addRequest({
 url: "https://stackoverflow.com",
 userData: {
   depth: 0,
 }
});

// ...

// Then, somewhere in handlePageFunction, when adding more requests to the queue.
if (request.userData.depth < MAX_DEPTH) {
  await requestQueue.addRequest({
    url: "https://example.com",
    userData: {
      depth: request.userData.depth + 1,
  }
});

}

score 0 · Accepted Answer

您可以在apify/web-scraper中找到“最大爬行深度”选项。该工具是旧版 phantomJS 刮刀的替代品。它使用 puppeteer，并且具有非常相似的界面。

您甚至可以使用Apify SDK并使用 PuppeteerCrawler 自行实现最大深度。我建议使用 request.userData 来记录你的爬行深度。如果您对此解决方案感兴趣，可以查看web scraper 的源代码，了解它是如何在 web-scraper 中完成的。

web-crawler - 使用 Apify SDK 时有没有办法指定最大爬网深度？

2 回答 2

Related

Reference