我正在使用 puppeteer-cluster 来抓取网页。
如果我在每个网站(8-10 页)一次打开许多页面,连接速度会变慢并且出现许多超时错误,如下所示:
TimeoutError: Navigation Timeout Exceeded: 超过 30000ms
我只需要访问每个页面的 HTML 代码。我不需要等待 domcontentloaded 等等。
有没有办法告诉page.goto()只等待来自网络服务器的第一个响应?或者我需要使用其他技术而不是 puppeteer?
我正在使用 puppeteer-cluster 来抓取网页。
如果我在每个网站(8-10 页)一次打开许多页面,连接速度会变慢并且出现许多超时错误,如下所示:
TimeoutError: Navigation Timeout Exceeded: 超过 30000ms
我只需要访问每个页面的 HTML 代码。我不需要等待 domcontentloaded 等等。
有没有办法告诉page.goto()只等待来自网络服务器的第一个响应?或者我需要使用其他技术而不是 puppeteer?
domcontentloaded是第一个 html 内容的事件。
DOMContentLoaded 事件在初始 HTML 文档完全加载和解析后触发,无需等待样式表、图像和子框架完成加载。
以下将在加载初始 HTML 文档时完成加载。
await page.goto(url, {waitUntil: 'domcontentloaded'})
但是,如果您一次加载 10 个页面,您可以阻止图像或样式表以节省带宽并加快加载速度。
将下面的代码放在正确的位置(使用导航之前page.goto
),它将停止加载图像、样式表、字体和脚本。
await page.setRequestInterception(true);
page.on('request', (request) => {
if (['image', 'stylesheet', 'font', 'script'].indexOf(request.resourceType()) !== -1) {
request.abort();
} else {
request.continue();
}
});
@user3817605,我有完美的代码给你。:)
/**
* The methods `page.waitForNavigation` and `frame.waitForNavigation` wait for the page
* event `domcontentloaded` at minimum. This function returns a promise that resolves as
* soon as the specified page `event` happens.
*
* @param {puppeteer.Page} page
* @param {string} event Can be any event accepted by the method `page.on()`. E.g.: "requestfinished" or "framenavigated".
* @param {number} [timeout] optional time to wait. If not specified, waits forever.
*/
function waitForEvent(page, event, timeout) {
page.once(event, done);
let fulfill, timeoutId = (typeof timeout === 'number' && timeout >= 0) ? setTimeout(done, timeout) : -1;
return new Promise(resolve => fulfill = resolve);
function done() {
clearTimeout(timeoutId);
fulfill();
}
}
你要求一个函数只等待第一个响应,所以你像这样使用这个函数:
page.goto(<URL>); // use .catch(() => {}) if you kill the page too soon, to avoid throw errors on console
await waitForEvent(page, 'response'); // after this line here you alread have the html response received
这正是你所要求的。但请注意,“收到的响应”与“收到的完整 html 响应”不同。第一个是响应的开始,最后一个是响应的结束。所以,也许你想使用事件“requestfinished”代替“response”。事实上,您可以使用 puppeteer Page 接受的任何事件。它们是:close、console、dialog、domcontentloaded、error、frameattached、framedetached、framenavigated、load、metrics、pageerror、popup、request、requestfailed、requestfinished、response、workercreated、workerdestroyed。
尝试使用这些:requestfinished 或 framenavigated。也许他们会很适合你。
为了帮助您确定哪个最适合您,您可以设置如下测试代码:
const puppeteer = require('puppeteer');
/**
* The methods `page.waitForNavigation` and `frame.waitForNavigation` wait for the page
* event `domcontentloaded` at minimum. This function returns a promise that resolves as
* soon as the specified page `event` happens.
*
* @param {puppeteer.Page} page
* @param {string} event Can be any event accepted by the method `page.on()`. E.g.: "requestfinished" or "framenavigated".
* @param {number} [timeout] optional time to wait. If not specified, waits forever.
*/
function waitForEvent(page, event, timeout) {
page.once(event, done);
let fulfill, timeoutId = (typeof timeout === 'number' && timeout >= 0) ? setTimeout(done, timeout) : -1;
return new Promise(resolve => fulfill = resolve);
function done() {
clearTimeout(timeoutId);
fulfill();
}
}
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const cdp = await page.target().createCDPSession();
await cdp.send('Network.enable');
await cdp.send('Page.enable');
const t0 = Date.now();
page.on('request', req => console.log(`> ${Date.now() - t0} request start: ${req.url()}`));
page.on('response', req => console.log(`< ${Date.now() - t0} response: ${req.url()}`));
page.on('requestfinished', req => console.log(`. ${Date.now() - t0} request finished: ${req.url()}`));
page.on('requestfailed', req => console.log(`E ${Date.now() - t0} request failed: ${req.url()}`));
page.goto('https://www.google.com').catch(() => { });
await waitForEvent(page, 'requestfinished');
console.log(`\nThe page was released after ${Date.now() - t0}ms\n`);
await page.close();
await browser.close();
})();
/* The output should be something like this:
> 2 request start: https://www.google.com/
< 355 response: https://www.google.com/
> 387 request start: https://www.google.com/tia/tia.png
> 387 request start: https://www.google.com/images/branding/googlelogo/1x/googlelogo_color_272x92dp.png
. 389 request finished: https://www.google.com/
The page was released after 389ms
*/
我可以看到另外两种实现您想要的方法:使用page.waitForResponse
和page.waitForFunction
. 让我们看看两者。
使用 page.waitForResponse,您可以执行以下简单操作:
page.goto('https://www.google.com/').catch(() => {});
await page.waitForResponse('https://www.google.com/'); // don't forget to put the final slash
很简单,嗯?如果您不喜欢它,请尝试page.waitForFunction
等待 dedocument
创建:
page.goto('https://www.google.com/').catch(() => {});
await page.waitForFunction(() => document); // you can use `window` too. It is almost the same
此代码将等到document
存在。当 html 的第一位到达并且浏览器开始创建文档的 de DOM 树表示时,就会发生这种情况。
但请注意,尽管这两种解决方案都很简单,但它们都不会等到整个 html 页面/文档下载完毕。如果需要,您应该修改waitForEvent
我的另一个答案的功能,以接受您要完整下载的特定网址。例子:
/**
* The methods `page.waitForNavigation` and `frame.waitForNavigation` wait for the page
* event `domcontentloaded` at minimum. This function returns a promise that resolves as
* soon as the specified `requestUrl` resource has finished downloading, or `timeout` elapses.
*
* @param {puppeteer.Page} page
* @param {string} requestUrl pass the exact url of the resource you want to wait for. Paths must be ended with slash "/". Don't forget that.
* @param {number} [timeout] optional time to wait. If not specified, waits forever.
*/
function waitForRequestToFinish(page, requestUrl, timeout) {
page.on('requestfinished', onRequestFinished);
let fulfill, timeoutId = (typeof timeout === 'number' && timeout >= 0) ? setTimeout(done, timeout) : -1;
return new Promise(resolve => fulfill = resolve);
function done() {
page.removeListener('requestfinished', onRequestFinished);
clearTimeout(timeoutId);
fulfill();
}
function onRequestFinished(req) {
if (req.url() === requestUrl) done();
}
}
如何使用它:
page.goto('https://www.amazon.com/').catch(() => {});
await waitForRequestToFinish(page, 'https://www.amazon.com/', 3000);
显示整洁的控制台日志的完整示例:
const puppeteer = require('puppeteer');
/**
* The methods `page.waitForNavigation` and `frame.waitForNavigation` wait for the page
* event `domcontentloaded` at minimum. This function returns a promise that resolves as
* soon as the specified `requestUrl` resource has finished downloading, or `timeout` elapses.
*
* @param {puppeteer.Page} page
* @param {string} requestUrl pass the exact url of the resource you want to wait for. Paths must be ended with slash "/". Don't forget that.
* @param {number} [timeout] optional time to wait. If not specified, waits forever.
*/
function waitForRequestToFinish(page, requestUrl, timeout) {
page.on('requestfinished', onRequestFinished);
let fulfill, timeoutId = (typeof timeout === 'number' && timeout >= 0) ? setTimeout(done, timeout) : -1;
return new Promise(resolve => fulfill = resolve);
function done() {
page.removeListener('requestfinished', onRequestFinished);
clearTimeout(timeoutId);
fulfill();
}
function onRequestFinished(req) {
if (req.url() === requestUrl) done();
}
}
(async () => {
const netMap = new Map();
const browser = await puppeteer.launch();
const page = await browser.newPage();
const cdp = await page.target().createCDPSession();
await cdp.send('Network.enable');
await cdp.send('Page.enable');
const t0 = Date.now();
cdp.on('Network.requestWillBeSent', ({ requestId, request: { url: requestUrl } }) => {
netMap.set(requestId, requestUrl);
console.log(`> ${Date.now() - t0}ms\t requestWillBeSent:\t${requestUrl}`);
});
cdp.on('Network.responseReceived', ({ requestId }) => console.log(`< ${Date.now() - t0}ms\t responseReceived:\t${netMap.get(requestId)}`));
cdp.on('Network.dataReceived', ({ requestId, dataLength }) => console.log(`< ${Date.now() - t0}ms\t dataReceived:\t\t${netMap.get(requestId)} ${dataLength} bytes`));
cdp.on('Network.loadingFinished', ({ requestId }) => console.log(`. ${Date.now() - t0}ms\t loadingFinished:\t${netMap.get(requestId)}`));
cdp.on('Network.loadingFailed', ({ requestId }) => console.log(`E ${Date.now() - t0}ms\t loadingFailed:\t${netMap.get(requestId)}`));
// The magic happens here
page.goto('https://www.amazon.com').catch(() => { });
await waitForRequestToFinish(page, 'https://www.amazon.com/', 3000);
console.log(`\nThe page was released after ${Date.now() - t0}ms\n`);
await page.close();
await browser.close();
})();
/* OUTPUT EXAMPLE
[... lots of logs removed ...]
> 574ms requestWillBeSent: https://images-na.ssl-images-amazon.com/images/I/71vvXGmdKWL._AC_SY200_.jpg
< 574ms dataReceived: https://www.amazon.com/ 65536 bytes
< 624ms responseReceived: https://images-na.ssl-images-amazon.com/images/G/01/AmazonExports/Fuji/2019/February/Dashboard/computer120x._CB468850970_SY85_.jpg
> 628ms requestWillBeSent: https://images-na.ssl-images-amazon.com/images/I/81Hhc9zh37L._AC_SY200_.jpg
> 629ms requestWillBeSent: https://images-na.ssl-images-amazon.com/images/G/01/personalization/ybh/loading-4x-gray._CB317976265_.gif
< 631ms dataReceived: https://www.amazon.com/ 58150 bytes
. 631ms loadingFinished: https://www.amazon.com/
*/
此代码显示了许多请求和响应,但是一旦“ https://www.amazon.com/ ”已完全下载,该代码就会停止。