javascript - 网页无限滚动时如何让Apify爬虫滚动整页？

Question

我面临一个问题，我无法使用产品目录页面上的延迟加载将所有产品数据作为网站获取。这意味着它需要滚动直到加载整个页面。

我只获得首页产品数据。

score 2 · Accepted Answer

首先，您应该记住，可以实现无限滚动的方式有很多。有时您必须在途中单击按钮或进行任何类型的转换。我将在这里只介绍最简单的用例，它以一定的间隔向下滚动并在没有加载新产品时完成。

如果您使用Apify SDK构建自己的 actor ，您可以使用infiniteScroll 辅助实用功能。如果它没有涵盖您的用例，最好在Github上给我们反馈。
如果您使用的是通用 Scraper（Web Scraper或Puppeteer Scraper），则当前未内置无限滚动功能（但如果您将来阅读此内容可能会）。另一方面，自己实现也没有那么复杂，我给大家介绍一个简单的 Web Scraper 的解决方案pageFunction。

async function pageFunction(context) {
    // few utilities
    const { request, log, jQuery } = context;
    const $ = jQuery;

    // Here we define the infinite scroll function, it has to be defined inside pageFunction
    const infiniteScroll = async (maxTime) => {
        const startedAt = Date.now();
        let itemCount = $('.my-class').length; // Update the selector
        while (true) {
            log.info(`INFINITE SCROLL --- ${itemCount} items loaded --- ${request.url}`)
            // timeout to prevent infinite loop
            if (Date.now() - startedAt > maxTime) {
                return;
            }
            scrollBy(0, 9999);
            await context.waitFor(5000); // This can be any number that works for your website
            const currentItemCount = $('.my-class').length; // Update the selector

            // We check if the number of items changed after the scroll, if not we finish
            if (itemCount === currentItemCount) {
                return;
            }
            itemCount = currentItemCount;
        }
    }

    // Generally, you want to do the scrolling only on the category type page
    if (request.userData.label === 'CATEGORY') {
        await infiniteScroll(60000); // Let's try 60 seconds max

        // ... Add your logic for categories
    } else {
        // Any logic for other types of pages
    }
}

当然，这是一个非常微不足道的例子。有时它会变得更加复杂。我什至曾经使用 Puppeteer 直接导航鼠标并拖动一些可通过编程访问的滚动条。

javascript - 网页无限滚动时如何让Apify爬虫滚动整页？

1 回答 1

Related

Reference