apify - 从 sitemap.xml 抓取每个链接

Question

我是 Apify 的新手。

我想抓取每个链接sitemap.xml

更具体地说：我有以下情况：我的站点地图网址：https://www.mywebsite.com/sitemap.xml

我的站点地图链接如下所示：https://www.mywebsite.com/product_id/product

例如：https://www.mywebsite.com/534372/acer_laptop

我想问你是否有一个解决方案让我从每个链接中提取以下元素：title, product_image_url,price

我尝试了 Web Scraper 和 Legacy PhantomJS Crawler，但我认为我错过了一些东西，因为我无法获得我需要的元素。

score 0 · Accepted Answer

为了提高性能，要么

确保在高级设置中禁用这些选项：

下载媒体文件

下载 CSS 文件
如果您还没有 https://docs.apify.com/scraping/cheerio-scraper ，请考虑使用cheerio 而不是 web/puppeteer scraper
在 MP 上请求自定义优化解决方案： https ://apify.com/marketplace

score 0 · Accepted Answer

考虑使用 Puppeteer 创建一个函数。在浏览器中打开站点地图并查找单数标记类名称。这个功能可能是一个好的开始。我要自己尝试一下，看看它是否有效

  async function scrap() {
  

      const browser = await puppeteer.launch({
        headless: true,
        args: ["--no-sandbox", "--disable-setuid-sandbox"],
      });

      const page = await browser.newPage();

      await page.goto(`https://yourpage.it/sitemap.xml`);

      const data = await page.evaluate(() => {
       
        const link = document.querySelectorAll(".html-tag > span").innerHTML; //you should be able to loop through it
       
    
        return {
          link
         
        };
      });

      await page.close();
      await browser.close();
      return data;
   
  }

apify - 从 sitemap.xml 抓取每个链接

2 回答 2

Related

Reference