我目前正在研究scrapy,下面是我的spider.py代码
class ExampleSpider(BaseSpider):
name = "example"
allowed_domains = {"careers-preftherapy.icims.com"}
start_urls = [
"https://careers-preftherapy.icims.com/jobs/search"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
pageCount = hxs.select('//td[@class = "iCIMS_JobsTablePaging"]/table/tr/td[2]/text()').extract()[0].rstrip().lstrip()[-2:].strip()
for i in range(1,int(pageCount)+1):
yield Request("https://careers-preftherapy.icims.com/jobs/search?pr=%d"%i, callback=self.parsePage)
def parsePage(self, response):
hxs = HtmlXPathSelector(response)
urls_list_odd_id = hxs.select('//table[@class="iCIMS_JobsTable"]/tr/td[@class="iCIMS_JobsTableOdd iCIMS_JobsTableField_1"]/a/@href').extract()
print urls_list_odd_id,">>>>>>>odddddd>>>>>>>>>>>>>>>>"
urls_list_even_id = hxs.select('//table[@class="iCIMS_JobsTable"]/tr/td[@class="iCIMS_JobsTableEven iCIMS_JobsTableField_1"]/a/@href').extract()
print urls_list_odd_id,">>>>>>>Evennnn>>>>>>>>>>>>>>>>"
urls_list = []
urls_list.extend(urls_list_odd_id)
urls_list.extend(urls_list_even_id)
for i in urls_list:
yield Request(i.encode('utf-8'), callback=self.parseJob)
def parseJob(self, response):
pass
打开页面后,我在这里实现了分页
https://careers-preftherapy.icims.com/jobs/search?pr=1
https://careers-preftherapy.icims.com/jobs/search?pr=2
...........很快
我为每个 url 产生了请求(假设这里有 6 个页面)。当 scrapy 到达第一个 url 时,我试图从第一个 url 收集所有 href 标签
(https://careers-preftherapy.icims.com/jobs/search?pr=1)
,当它到达第二个 url 时同样收集所有 href 标签。
现在在我的代码中,如您所见,每个页面中共有 20 个 href 标签,其中 10 个 href 标签在td[@class="iCIMS_JobsTableOdd iCIMS_JobsTableField_1"]
\ 下,其余的在td[@class="iCIMS_JobsTableEven iCIMS_JobsTableField_1"]
.
问题出在哪里,有时会下载标签,有时则不知道发生了什么,我的意思是当我们运行蜘蛛文件两次它正在下载时,另一次它返回一个空列表,如下所示
第一次运行:
2012-07-17 17:05:20+0530 [Preferredtherapy] DEBUG: Crawled (200) <GET https://careers-preftherapy.icims.com/jobs/search?pr=2> (referer: https://careers-preftherapy.icims.com/jobs/search)
[] >>>>>>>odddddd>>>>>>>>>>>>>>>>
[] >>>>>>>Evennnn>>>>>>>>>>>>>>>>
第二次跑
2012-07-17 17:05:20+0530 [Preferredtherapy] DEBUG: Crawled (200) <GET https://careers-preftherapy.icims.com/jobs/search?pr=2> (referer: https://careers-preftherapy.icims.com/jobs/search)
[u'https://careers-preftherapy.icims.com/jobs/1836/job', u'https://careers-preftherapy.icims.com/jobs/1813/job', u'https://careers-preftherapy.icims.com/jobs/1763/job']>>>>>>>odddddd>>>>>>>>>>>>>>>>
[preftherapy.icims.com/jobs/1811/job', u'https://careers-preftherapy.icims.com/jobs/1787/job']>>>>>>>Evennnn>>>>>>>>>>>>>>>>
我的问题是为什么它有时会下载有时不会,请尝试回复我这对我真的很有帮助。
提前致谢.....