0

我在一个scrapy蜘蛛中有以下类方法。parse_category产生一个 Request 对象,该对象具有对 的回调parse_product。有时,类别页面会重定向到产品页面。所以在这里我检测一个类别页面是否是一个产品页面。如果是,我只是调用该parse_product方法。但由于某种原因,它没有调用该方法。

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        anchors = hxs.select('//div[@id="panelMfr"]/div/ul/li[position() != last()]/a')
        for anchor in anchors[2:3]:
            url = anchor.select('@href').extract().pop()
            cat = anchor.select('text()').extract().pop().strip()
            yield Request(urljoin(get_base_url(response), url), callback=self.parse_category, meta={"category": cat})

    def parse_category(self, response):
        hxs = HtmlXPathSelector(response)
        base_url = get_base_url(response)

        # check if its a redirected product page
        if (hxs.select(self.product_name_xpath)):
            self.log("Category-To-Product Redirection")
            self.parse_product(response)  # <<---- This line is not called.
            self.log("Product Parsed")
            return

        products_xpath = '//div[@class="productName"]/a/@href'

        products = hxs.select(products_xpath).extract()
        for url in products:
            yield Request(urljoin(base_url, url), callback=self.parse_product, meta={"category": response.meta['category']})

        next_page = hxs.select('//table[@class="nav-back"]/tr/td/span/a[contains(text(), "Next")]/text()').extract()

        if next_page:
            url = next_page[0]
            yield Request(urljoin(base_url, url), callback=self.parse_category, meta={"category": response.meta['category']})

    def parse_product(self, response):
        hxs = HtmlXPathSelector(response)
        base_url = get_base_url(response)

        self.log("Inside parse_product")

在日志中我看到Category-To-Product RedirectionProduct Parsed打印但Inside parse_product丢失了。我在这里做错了什么?

2013-07-12 21:31:34+0100 [example.com] DEBUG: Crawled (200) <GET http://www.example.com/category.aspx> (referer: None)
2013-07-12 21:31:34+0100 [example.com] DEBUG: Redirecting (302) to <GET http://www.example.com/productinfo.aspx?catref=AM6901> from <GET http://www.example.com/products/Inks-Toners/Apple>
2013-07-12 21:31:35+0100 [example.com] DEBUG: Crawled (200) <GET http://www.example.com/productinfo.aspx?catref=AM6901> (referer: http://www.example.com/category.aspx)
2013-07-12 21:31:35+0100 [example.com] DEBUG: Category-To-Product Redirection
2013-07-12 21:31:35+0100 [example.com] DEBUG: Product Parsed
2013-07-12 21:31:35+0100 [example.com] INFO: Closing spider (finished)

2013-07-12 21:31:35+0100 [-] ERROR: ERROR:root:SPIDER CLOSED: No. of products: 0
4

0 回答 0