0

我是scrapy的新手,到目前为止我已经能够创建一些蜘蛛。我想写一个爬取黄页的蜘蛛,寻找有 404 响应的网站,蜘蛛工作正常,但是分页不起作用。任何帮助都感激不尽。提前致谢

# -*- coding: utf-8 -*-
import scrapy


class SpiderSpider(scrapy.Spider):
    name = 'spider'
    #allowed_domains = ['www.yellowpages.com']
    start_urls = ['https://www.yellowpages.com/search?search_terms=handyman&geo_location_terms=Miami%2C+FL']

    def parse(self, response):
    for listing in response.css('div.search-results.organic div.srp-listing'):

        url = listing.css('a.track-visit-website::attr(href)').extract_first()

        yield scrapy.Request(url=url, callback=self.parse_details)


    # follow pagination links

    next_page_url = response.css('a.next.ajax-page::attr(href)').extract_first()
    next_page_url = response.urljoin(next_page_url)
    if next_page_url:
        yield scrapy.Request(url=next_page_url, callback=self.parse)

    def parse_details(self,response):
    yield{'Response': response,}
4

1 回答 1

1

我运行了你的代码,发现有一些错误。在第一个循环中,您不检查 的值,url有时它是None。此错误会停止执行,这就是您认为分页不起作用的原因。

这是一个工作代码:

# -*- coding: utf-8 -*-
import scrapy


class SpiderSpider(scrapy.Spider):
    name = 'spider'
    #allowed_domains = ['www.yellowpages.com']
    start_urls = ['https://www.yellowpages.com/search?search_terms=handyman&geo_location_terms=Miami%2C+FL']

    def parse(self, response):
        for listing in response.css('div.search-results.organic div.srp-listing'):
            url = listing.css('a.track-visit-website::attr(href)').extract_first()
            if url:
                yield scrapy.Request(url=url, callback=self.parse_details)
        next_page_url = response.css('a.next.ajax-page::attr(href)').extract_first()
        next_page_url = response.urljoin(next_page_url)
        if next_page_url:
            yield scrapy.Request(url=next_page_url, callback=self.parse)

    def parse_details(self,response):
        yield{'Response': response,}
于 2017-07-02T10:47:11.497 回答