0

我在这里使用这个例子。要使用 Tor/Privoxy 更改我的身份,但我遇到了几个问题,例如必须多次键入“scrapy crawl something.py”来启动蜘蛛,或者让蜘蛛在爬行过程中突然停止而没有任何错误消息.

东西.py

class IT(CrawlSpider):
    name = 'IT'

    allowed_domains = ["www.jobstreet.com.sg"]
    start_urls = [
        'https://jobscentral.com.sg/jobs-it',
    ]

    custom_settings = {
                       'TOR_RENEW_IDENTITY_ENABLED': True,
                       'TOR_ITEMS_TO_SCRAPE_PER_IDENTITY': 20
                       }

    download_delay = 4
    handle_httpstatus_list = [301, 302]

    rules = (
        #Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//li[@class="page-item"]/a[@aria-label="Next"]',)), callback="self.parse", follow=True),
        Rule(LinkExtractor(allow_domains=("jobscentral.com.sg", ), restrict_xpaths=('//li[@class="page-item"]/a[@aria-label="Next"]',)), callback='self.parse', follow=True),
    )

    def parse(self, response): 
            items = []

            self.logger.info("Visited Outer Link %s", response.url)

            for sel in response.xpath('//h4'):
                item = JobsItems()

            next_page = response.xpath('//li[@class="page-item"]/a[@aria-label="Next"]/@href').extract_first()

            if next_page:
                base_url = get_base_url(response)
                absolute_next_page = urljoin(base_url,next_page)
                yield scrapy.Request(absolute_next_page, self.parse, dont_filter=True)

    def parse_jobdetails(self, response):

        self.logger.info('Visited Internal Link %s', response.url)
        print response
        item = response.meta['item']
        item = self.getJobInformation(item, response)
        return item

    def getJobInformation(self, item, response):
        trans_table = {ord(c): None for c in u'\r\n\t\u00a0'}

        item['jobnature'] = response.xpath('//job-snapshot/dl/div[1]/dd//text()').extract_first()
        return item

开始爬取失败时的错误信息:

2017-09-12 16:55:09 [scrapy.middleware] INFO: Enabled item pipelines:
['jobscentral.pipelines.JobscentralPipeline']
2017-09-12 16:55:09 [scrapy.core.engine] INFO: Spider opened
2017-09-12 16:55:09 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-09-12 16:55:09 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-09-12 16:55:11 [scrapy.extensions.throttle] INFO: slot: jobscentral.com.sg | conc: 1 | delay: 4000 ms (-1000) | latency: 1993 ms | size: 67510 bytes
2017-09-12 16:55:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://jobscentral.com.sg/jobs-it> (referer: None)
2017-09-12 16:55:11 [IT] INFO: got response 200 for 'https://jobscentral.com.sg/jobs-it'
2017-09-12 16:55:11 [IT] INFO: Visited Outer Link https://jobscentral.com.sg/jobs-it
2017-09-12 16:55:11 [scrapy.core.engine] INFO: Closing spider (finished)
2017-09-12 16:55:11 [IT] DEBUG: Closing connection pool...

编辑:错误日志

<<<HUGE CHUNK OF HTML>> from response.body here
---------------------------------------------------------
2017-09-12 17:39:01 [IT] INFO: Visited Outer Link https://jobscentral.com.sg/jobs-it
2017-09-12 17:39:01 [scrapy.core.engine] INFO: Closing spider (finished)
2017-09-12 17:39:01 [IT] DEBUG: Closing connection pool...
2017-09-12 17:39:01 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 290,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 68352,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 9, 12, 9, 39, 1, 683612),
 'log_count/DEBUG': 4,
 'log_count/INFO': 12,
 'memusage/max': 58212352,
 'memusage/startup': 58212352,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 9, 12, 9, 38, 58, 660671)}
2017-09-12 17:39:01 [scrapy.core.engine] INFO: Spider closed (finished)
4

0 回答 0