我在这里使用这个例子。要使用 Tor/Privoxy 更改我的身份,但我遇到了几个问题,例如必须多次键入“scrapy crawl something.py”来启动蜘蛛,或者让蜘蛛在爬行过程中突然停止而没有任何错误消息.
东西.py
class IT(CrawlSpider):
name = 'IT'
allowed_domains = ["www.jobstreet.com.sg"]
start_urls = [
'https://jobscentral.com.sg/jobs-it',
]
custom_settings = {
'TOR_RENEW_IDENTITY_ENABLED': True,
'TOR_ITEMS_TO_SCRAPE_PER_IDENTITY': 20
}
download_delay = 4
handle_httpstatus_list = [301, 302]
rules = (
#Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//li[@class="page-item"]/a[@aria-label="Next"]',)), callback="self.parse", follow=True),
Rule(LinkExtractor(allow_domains=("jobscentral.com.sg", ), restrict_xpaths=('//li[@class="page-item"]/a[@aria-label="Next"]',)), callback='self.parse', follow=True),
)
def parse(self, response):
items = []
self.logger.info("Visited Outer Link %s", response.url)
for sel in response.xpath('//h4'):
item = JobsItems()
next_page = response.xpath('//li[@class="page-item"]/a[@aria-label="Next"]/@href').extract_first()
if next_page:
base_url = get_base_url(response)
absolute_next_page = urljoin(base_url,next_page)
yield scrapy.Request(absolute_next_page, self.parse, dont_filter=True)
def parse_jobdetails(self, response):
self.logger.info('Visited Internal Link %s', response.url)
print response
item = response.meta['item']
item = self.getJobInformation(item, response)
return item
def getJobInformation(self, item, response):
trans_table = {ord(c): None for c in u'\r\n\t\u00a0'}
item['jobnature'] = response.xpath('//job-snapshot/dl/div[1]/dd//text()').extract_first()
return item
开始爬取失败时的错误信息:
2017-09-12 16:55:09 [scrapy.middleware] INFO: Enabled item pipelines:
['jobscentral.pipelines.JobscentralPipeline']
2017-09-12 16:55:09 [scrapy.core.engine] INFO: Spider opened
2017-09-12 16:55:09 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-09-12 16:55:09 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-09-12 16:55:11 [scrapy.extensions.throttle] INFO: slot: jobscentral.com.sg | conc: 1 | delay: 4000 ms (-1000) | latency: 1993 ms | size: 67510 bytes
2017-09-12 16:55:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://jobscentral.com.sg/jobs-it> (referer: None)
2017-09-12 16:55:11 [IT] INFO: got response 200 for 'https://jobscentral.com.sg/jobs-it'
2017-09-12 16:55:11 [IT] INFO: Visited Outer Link https://jobscentral.com.sg/jobs-it
2017-09-12 16:55:11 [scrapy.core.engine] INFO: Closing spider (finished)
2017-09-12 16:55:11 [IT] DEBUG: Closing connection pool...
编辑:错误日志
<<<HUGE CHUNK OF HTML>> from response.body here
---------------------------------------------------------
2017-09-12 17:39:01 [IT] INFO: Visited Outer Link https://jobscentral.com.sg/jobs-it
2017-09-12 17:39:01 [scrapy.core.engine] INFO: Closing spider (finished)
2017-09-12 17:39:01 [IT] DEBUG: Closing connection pool...
2017-09-12 17:39:01 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 290,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 68352,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 9, 12, 9, 39, 1, 683612),
'log_count/DEBUG': 4,
'log_count/INFO': 12,
'memusage/max': 58212352,
'memusage/startup': 58212352,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 9, 12, 9, 38, 58, 660671)}
2017-09-12 17:39:01 [scrapy.core.engine] INFO: Spider closed (finished)