1

我对高级 javascript 不太熟悉,正在寻找一些指导。我正在寻找使用puppeteer-cluster将网页内容存储到数据库中 这是一个起始示例:

const { Cluster } = require('puppeteer-cluster');

(async () => {
  const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_CONTEXT,
    maxConcurrency: 2,
  });

  await cluster.task(async ({ page, data: url }) => {
    await page.goto(url);
    const screen = await page.content();
    // Store content, do something else
  });

  cluster.queue('http://www.google.com/');
  cluster.queue('http://www.wikipedia.org/');
  // many more pages

  await cluster.idle();
  await cluster.close();
})();

看起来我可能必须使用pg 插件连接到数据库。推荐的方法是什么?

这是我的桌子:

+----+-----------------------------------------------------+---------+
| id | url                                                 | content |
+----+-----------------------------------------------------+---------+
| 1  | https://www.npmjs.com/package/pg                    |         |
+----+-----------------------------------------------------+---------+
| 2  | https://github.com/thomasdondorf/puppeteer-cluster/ |         |
+----+-----------------------------------------------------+---------+

我相信我必须将数据拉入一个数组(id 和 url),每次收到内容后,将其存储到数据库中(通过 id 和内容)。

4

1 回答 1

1

您应该在任务函数之外创建一个数据库连接:

const { Client } = require('pg');
const client = new Client(/* ... */);
await client.connect();

然后查询数据并将其排队(使用 ID 以便稍后将其保存在数据库中):

const rows = await pool.query('SELECT id, url FROM your_table WHERE ...');
rows.forEach(row => cluster.queue({ id: row.id, url: row.url }));

然后,在任务函数结束时,更新表格行。

await cluster.task(async ({ page, data: { id, url, id } }) => {
    // ... run puppeteer and save results in content variable
    await pool.query('UPDATE your_table SET content=$1 WHERE id=$2', [content, id]);
});

总的来说,您的代码应该如下所示(请注意,我自己没有测试过代码):

const { Cluster } = require('puppeteer-cluster');
const { Client } = require('pg');

(async () => {
    const client = new Client(/* ... */);
    await client.connect();

    const cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_CONTEXT,
        maxConcurrency: 2,
    });

    await cluster.task(async ({ page, data: { id, url } }) => {
        await page.goto(url);
        const content = await page.content();
        await pool.query('UPDATE your_table SET content=$1 WHERE id=$2', [content, id]);
    });

    const rows = await pool.query('SELECT id, url FROM your_table');
    rows.forEach(row => cluster.queue({ id: row.id, url: row.url }));

    await cluster.idle();
    await cluster.close();
})();
于 2019-03-20T18:54:21.213 回答