1

I have a list of urls that need to be scraped from a website that uses React, for this reason I am using Puppeteer. I do not want to be blocked by anti-bot servers, for this reason I have added puppeteer-extra-plugin-stealth I want to prevent ads from loading on the pages, so I am blocking ads by using puppeteer-extra-plugin-adblocker I also want to prevent my IP address from being blacklisted, so I have used TOR nodes to have different IP addresses. Below is a simplified version of my code and the setup works (TOR_port and webUrl are assigned dynamically though but for simplifying my question I have assigned it as a variable) . There is a problem though:

const puppeteer = require('puppeteer-extra');
const _StealthPlugin = require('puppeteer-extra-plugin-stealth');
const _AdblockerPlugin = require('puppeteer-extra-plugin-adblocker');

puppeteer.use(_StealthPlugin());
puppeteer.use(_AdblockerPlugin());

var TOR_port = 13931;
var webUrl ='https://www.zillow.com/homedetails/2861-Bass-Haven-Ln-Saint-Augustine-FL-32092/47739703_zpid/';


const browser = await puppeteer.launch({
    dumpio: false,
    headless: false,
    args: [
        `--proxy-server=socks5://127.0.0.1:${TOR_port}`,
        `--no-sandbox`,
    ],
    ignoreHTTPSErrors: true,
});

try {
    const page = await browser.newPage();
    await page.setViewport({ width: 1280, height: 720 });
    await page.goto(webUrl, {
        waitUntil: 'load',
        timeout: 30000,
    });

    page
    .waitForSelector('.price')
    .then(() => {
        console.log('The price is available');
        await browser.close();
    })
    .catch(() => {
        // close this since it is clearly not a zillow website
        throw new Error('This is not the zillow website');
    });
} catch (e) {
    await browser.close();
}

The above setup works but is very unreliable and I recently learnt about Puppeteer-Cluster. I need it to help me manage crawling multiple pages, to track my scraping tasks.

So, my question is how do I implement Puppeteer-Cluster with the above set-up. I am aware of an example(https://github.com/thomasdondorf/puppeteer-cluster/blob/master/examples/different-puppeteer-library.js) offered by the library to show how you can implement plugins, but is so bare that I didn't quite understand it.

How do I implement Puppeteer-Cluster with the above TOR, AdBlocker, and Stealth configurations?

4

2 回答 2

2

You can just hand over your puppeteer Instance like following:

const puppeteer = require('puppeteer-extra');
const _StealthPlugin = require('puppeteer-extra-plugin-stealth');
const _AdblockerPlugin = require('puppeteer-extra-plugin-adblocker');

puppeteer.use(_StealthPlugin());
puppeteer.use(_AdblockerPlugin());

const browser = await puppeteer.launch({
    puppeteer,
});

Src: https://github.com/thomasdondorf/puppeteer-cluster#clusterlaunchoptions

于 2021-01-13T15:43:30.737 回答
0

To combine puppeteer with plugins, you must first require puppeteer-extra, your plugins, and puppeteer-cluster.

puppeteer-extra has a function addExtra() which takes a puppeteer instance, existing or new.

The puppeteer-extra .use() function accepts Plugin(s) as a parameter.

Insert the puppeteer instance as the first property in an Object that then goes into puppeteer-cluster's Cluster.launch() function.

To add options for the puppeteer instance, create an Object with all your puppeteerOptions. Insert this as the second property in the Object that goes into Cluster.launch() function. See more about adding puppeteerOptions at the 2nd link below.

Read more about using puppeteer with puppeteer-cluster under the "More Examples > Using with puppeteer-cluster" section on the puppeteer-extra or the puppeteer-cluster documentation at the 1st link below.

Source: https://www.npmjs.com/package/puppeteer-extra
Source: https://www.npmjs.com/package/puppeteer-cluster#clusterlaunchoptions

const { addExtra } = require("puppeteer-extra");
const vanillaPuppeteer = require("puppeteer");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
const RecaptchaPlugin = require("puppeteer-extra-plugin-recaptcha");
const { Cluster } = require("puppeteer-cluster");

(async () => {
  const puppeteer = addExtra(vanillaPuppeteer);
  puppeteer.use(StealthPlugin());
  puppeteer.use(RecaptchaPlugin());

  const cluster = await Cluster.launch({
    puppeteer,
    // puppeteerOptions go here, i.e., userDataDir, args (proxy server, 
    // etc..
    // puppeteerOptions, 
    concurrency: Cluster.CONCURRENCY_CONTEXT,
    maxConcurrency: 2,
  });

  let i = 0; // screenshot counter for filename
  await cluster.task(async ({ page, data: url }) => {
    await page.goto(url);
    const screen = await page.screenshot({
      path: `${"screenshot" + i++ + ".png"}`,
      fullPage: true,
    });
  });

  cluster.queue("http://www.google.com/");
  cluster.queue("http://www.wikipedia.org/");

  await cluster.idle();
  await cluster.close();
  console.log("Program is finished\nBye!");
})();

于 2022-01-07T20:45:15.130 回答