0

我对木偶师很陌生。我昨天开始,我正在尝试制作一个程序,它可以遍历一个 url,一个接一个地增量存储玩家 ID,并使用 neDB 保存玩家统计信息。有数千个链接需要翻阅,我发现如果我使用 for 循环,我的计算机基本上会崩溃,因为 1,000 个 Chromium 试图同时打开所有这些。有没有更好的方法或正确的方法来做到这一点?任何意见,将不胜感激。

const puppeteer = require('puppeteer');
const Datastore = require('nedb');

const database = new Datastore('database.db');
database.loadDatabase();

async function scrapeProduct(url){
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(url);

  let attributes = [];

  //Getting player's name
  const [name] = await page.$x('//*[@id="ctl00_ctl00_ctl00_Main_Main_name"]');
  const txt = await name.getProperty('innerText');
  const playerName = await txt.jsonValue();
  attributes.push(playerName);

  //Getting all 12 individual stats of the player
  for(let i = 1; i < 13; i++){
    let vLink = '//*[@id="ctl00_ctl00_ctl00_Main_Main_SectionTabBox"]/div/div/div/div[1]/table/tbody/tr['+i+']/td[2]';
    const [e1] = await page.$x(vLink);
    const val = await e1.getProperty('innerText');
    const skillVal = await val.jsonValue();
    attributes.push(skillVal);
  }

  //creating a player object to store the data how i want (i know this is probably ugly code and could be done in a much better way)
  let player = {
    Name: attributes[0],
    Athleticism: attributes[1],
    Speed: attributes[2],
    Durability: attributes[3],
    Work_Ethic: attributes[4],  
    Stamina: attributes[5], 
    Strength: attributes[6],    
    Blocking: attributes[7],
    Tackling: attributes[8],    
    Hands: attributes[9],   
    Game_Instinct: attributes[10],
    Elusiveness: attributes[11],    
    Technique: attributes[12],
  };

  database.insert(player);
  await browser.close();
}

//For loop to loop through 1000 player links... Url.com is swapped in here because the actual url is ridiculously long and not important.
for(let i = 0; i <= 1000; i++){
  let link = 'https://url.com/?id='+i+'&section=Ratings';
  scrapeProduct(link);
  console.log("Player #" + i + " scrapped");
}
4

2 回答 2

0

如果您认为速度问题是每次运行时重新打开/关闭浏览器,请将浏览器移动到全局范围并将其初始化为 null。然后使用以下内容创建一个 init 函数:

async function init(){
  if(!browser)
    browser = await puppeteer.launch()
}

允许将页面传递给您的 scrapeProduct 函数。async function scrapeProduct(url)变成async function scrapeProduct(url,page). 替换await browser.close()await page.close()。现在您的循环将如下所示:

//For loop to loop through 1000 player links... Url.com is swapped in here because the actual url is ridiculously long and not important.
await init();
for(let i = 0; i <= 1000; i++){
  let link = 'https://url.com/?id='+i+'&section=Ratings';
  let page = await browser.newPage()
  scrapeProduct(link,page);
  console.log("Player #" + i + " scrapped");
}
await browser.close()

如果您想限制浏览器同时运行的页面数,您可以创建一个函数来执行此操作:

async function getTotalPages(){
  const allPages = await browser.pages()
  return allPages.length
}
async function newPage(){
  const MAX_PAGES = 5
  await new Promise(resolve=>{
    // check once a second to check on pages open
    const interval = setInterval(async ()=>{
      let totalPages = await getTotalPages()
      if(totalPages< MAX_PAGES){
        clearInterval(interval)
        resolve()
      }
    },1000)
  })
  return await browser.newPage()
}

如果你这样做了,在你的循环中你会替换let page = await browser.newPagelet page = await newPage()

于 2020-12-24T10:22:00.420 回答
0

最简单的调整是等待每个链接完成,然后再开始下一个:

(async () => {
  for(let i = 0; i <= 1000; i++){
    let link = 'https://url.com/?id='+i+'&section=Ratings';
    await scrapeProduct(link);
    console.log("Player #" + i + " scrapped");
  }
})();

您也可以只允许您的计算机可以处理的足够打开。这将需要更多资源,但可以让流程更快地完成。找出您想要的限制,然后执行以下操作:

let i = 0;
const getNextLink = () => {
  if (i > 1000) return;
  let link = 'https://url.com/?id='+i+'&section=Ratings';
  i++;
  return scrapeProduct(link)
    .then(getNextLink)
    .catch(handleErrors);
};
Promise.all(Array.from(
  { length: 4 }, // allow 4 to run concurrently
  getNextLink
))
  .then(() => {
    // all done
  });

以上允许scrapeProduct在任何时候激活 4 个呼叫 - 根据需要更改号码。

于 2020-12-22T18:37:53.313 回答