1

我是scrapy和python的新手。我正在使用scrapy 0.17.0 。我在一个网站上设置了爬虫,该网站在多次请求后向我发送了一个验证码页面。我设置了 10 个并发请求。现在,当我得到验证码页面时,我想保留进一步的请求,直到我下载验证码图像并解决它。

一旦我的验证码得到解决,我想恢复我的请求队列。但我不知道如何暂停请求队列。当我获得 302 状态(这是验证码的页面)时,我添加了睡眠时间,但这不起作用。

下面是我的settings.py

    BOT_NAME = 'testBot'
    SPIDER_MODULES = ['testCrawler.spiders']
    NEWSPIDER_MODULE = 'testCrawler.spiders'

    CONCURRENT_REQUESTS_PER_DOMAIN = 10
    CONCURRENT_SPIDERS = 5

    DOWNLOAD_DELAY = 5
    COOKIES_ENABLED = 'false'

    # SET USER AGENTS LIST
    USER_AGENTS = ['Mozilla/4.0  (compatible; MSIE 6.0; Windows NT 5.1; SV1; BTRS106490)',
                'Mozilla/4.0  (compatible; MSIE 7.0; Windows NT 6.2; .NET4.0E; .NET4.0C)',
                'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Trident/6.0)',
                'Mozilla/5.0 (X11; Linux i686; rv:8.0) Gecko/20100101 Firefox/8.0']

    PROXIES = ['http://192.168.100.225:8123']

    DOWNLOADDELAYLIST = ['3', '4', '6', '5']

    RETRY_TIMES = 20
    RETRY_HTTP_CODES = [500, 502, 503, 504, 400, 408, 302]

这是我的爬虫

    import time
    import re
    from scrapy.http import Request
    from scrapy.selector import HtmlXPathSelector
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.contrib.spiders import CrawlSpider, Rule
    from testCrawler.items import linkItem
    from testCrawler.imageItems import linkImageItem

    class CategorySpider(CrawlSpider):
        name = 'categoryLink'
        allowed_domains = ['somedomail.com']
        start_urls = ['http://somesite.com/topsearches']

        def parse(self, response):
            self.state['items_count'] = self.state.get('items_count', 0) + 1
            self.logCaptchaPages(response.status, response.url)

            hxs = HtmlXPathSelector(response)
            catLinks = hxs.select('//div[@class="topsearcheschars"]/a/@href').extract()

            for catLink in catLinks:
                if re.match('(.*?)/[0-9]+$', catLink):
                    continue
                else:
                    yield Request(catLink, callback=self.alphaDetailPage)

        def alphaDetailPage(self, aResponse):
            self.logCaptchaPages(aResponse.status, aResponse.url)
            hxs = HtmlXPathSelector(aResponse)
            pageLinks = hxs.select('//div[@class="topsearcheschars"]/a/@href').extract()
            dtlLinks = hxs.select('//div[@class="topsearches"]/a/@href').extract()

            for dtlLink in dtlLinks:
                yield Request(dtlLink, callback=self.listPageLinks)

            for pageLink in pageLinks:
                if re.match('(.*?)/[0-9]+$', pageLink):
                    yield Request(pageLink,callback=self.pageDetail)

        def pageDetail(self, bResponse):
            self.logCaptchaPages(bResponse.status, bResponse.url)
            hxs = HtmlXPathSelector(bResponse)
            dtlLinks = hxs.select('//div[@class="topsearches"]/a/@href').extract()

            for dtlLink in dtlLinks:
                yield Request(dtlLink, callback=self.listPageLinks)

        def listPageLinks(self, lResponse):
            self.logCaptchaPages(lResponse.status, lResponse.url)
            hxs = HtmlXPathSelector(lResponse)
            similarSearchLinks = hxs.select('//a[@class="similar_search"]/@href').extract()

            if len(similarSearchLinks) > 0:
                for i in range(len(similarSearchLinks)):
                    yield Request(similarSearchLinks[i], callback=self.listPageLinks)

            itm = linkItem()
            titleList = hxs.select('//div[@id="h1-wrapper"]/h1/text()').extract()

            if len(titleList) > 0:
                itm['url'] = lResponse.url
                itm['title'] = titleList[0]
                yield itm
            else:
                yield

        def logCaptchaPages(self, statusCode, urlToLog):
            if statusCode == 302:
                yield Request(urlToLog, callback=self.downloadImage)
                time.sleep(10)

        def downloadImage(self, iResponse):
            hxs = HtmlXPathSelector(iResponse)
            imageUrl = hxs.select('//body/img/@src').extract()[0]
            itm = linkImageItem()
            itm['url'] = iResponse.url
            itm['image_urls'] = [imageUrl]
            yield itm

目前我正在测试一个验证码图像下载,一旦它工作,我计划调用其他函数,它将向验证码页面发送一个带有验证码文本的请求。一旦该验证码页面通过,我想处理下一个请求。

关于为什么它不起作用的任何想法?

在这种情况下,我可能做错了,谁能指出哪里出了问题?

任何帮助是极大的赞赏。谢谢 :)

4

1 回答 1

0

您可以尝试在方法中交换time.sleep(10)和,以便在 10 秒暂停后返回您的请求。yield Request(urlToLog, callback=self.downloadImage)logCaptchaPages

def logCaptchaPages(self, statusCode, urlToLog):
    if statusCode == 302:
        print "Got CAPTCHA page"
        time.sleep(10)
        yield Request(urlToLog, callback=self.downloadImage)
于 2013-05-17T08:13:01.917 回答