python - 使用 scrapy 框架抓取 Monster.com

Question

如何为 monster.com 创建一个爬虫来爬取所有页面。对于“下一页”链接，monster.com 调用了 javascript 函数，但 scrapy 无法识别 javascript

这是我的代码，它不适用于分页：

import scrapy
class MonsterComSpider(scrapy.Spider):
    name = 'monster.com'
    allowed_domains = ['www.monsterindia.com']
    start_urls = ['http://www.monsterindia.com/data-analyst-jobs.html/']

    def parse(self, response):
        urls = response.css('h2.seotitle > a::attr(href)').extract()

        for url in urls:
            yield scrapy.Request(url =url, callback = self.parse_details)

    #crawling all the pages

        next_page_url = response.css('ul.pager > li > a::attr(althref)').extract()
        if next_page_url:
           next_page_url = response.urljoin(next_page_url) 
           yield scrapy.Request(url = next_page_url, callback = self.parse)            


    def parse_details(self,response):
        yield {         
        'name' : response.css('h3 > a > span::text').extract()
        }

score 1 · Accepted Answer

您的代码会引发异常，因为and方法需要next_page_url一个字符串。下一页链接提取应如下所示：listresponse.urljoin

next_page_url = response.css('ul.pager > li > a::attr(althref)').extract_first()

（即替换extract()为extract_first()）

编辑：

next_page_url提取还有另一个问题。所有逻辑都是正确的并且分页有效，但下一页链接仅适用于第一页。它需要 first a，但在第二页上，还有上一页链接。将下一页的url提取修改为：

next_page_url = response.css('ul.pager').xpath('//a[contains(text(), "Next")]/@althref').extract_first()

现在它正确地对所有页面进行分页。

python - 使用 scrapy 框架抓取 Monster.com

1 回答 1

Related

Reference