0

我目前正在研究scrapy,下面是我的spider.py代码

class ExampleSpider(BaseSpider):
    name = "example"
    allowed_domains = {"careers-preftherapy.icims.com"}


    start_urls = [
        "https://careers-preftherapy.icims.com/jobs/search"
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        pageCount = hxs.select('//td[@class = "iCIMS_JobsTablePaging"]/table/tr/td[2]/text()').extract()[0].rstrip().lstrip()[-2:].strip()
        for i in range(1,int(pageCount)+1):
            yield Request("https://careers-preftherapy.icims.com/jobs/search?pr=%d"%i, callback=self.parsePage)

    def parsePage(self, response):
        hxs = HtmlXPathSelector(response)
        urls_list_odd_id = hxs.select('//table[@class="iCIMS_JobsTable"]/tr/td[@class="iCIMS_JobsTableOdd iCIMS_JobsTableField_1"]/a/@href').extract()
        print urls_list_odd_id,">>>>>>>odddddd>>>>>>>>>>>>>>>>"
        urls_list_even_id = hxs.select('//table[@class="iCIMS_JobsTable"]/tr/td[@class="iCIMS_JobsTableEven iCIMS_JobsTableField_1"]/a/@href').extract()
        print urls_list_odd_id,">>>>>>>Evennnn>>>>>>>>>>>>>>>>"
        urls_list = []
        urls_list.extend(urls_list_odd_id)
        urls_list.extend(urls_list_even_id)
        for i in urls_list:
            yield Request(i.encode('utf-8'), callback=self.parseJob)


    def parseJob(self, response):
        pass

打开页面后,我在这里实现了分页

https://careers-preftherapy.icims.com/jobs/search?pr=1
 https://careers-preftherapy.icims.com/jobs/search?pr=2

...........很快

我为每个 url 产生了请求(假设这里有 6 个页面)。当 scrapy 到达第一个 url 时,我试图从第一个 url 收集所有 href 标签 (https://careers-preftherapy.icims.com/jobs/search?pr=1) ,当它到达第二个 url 时同样收集所有 href 标签。

现在在我的代码中,如您所见,每个页面中共有 20 个 href 标签,其中 10 个 href 标签在td[@class="iCIMS_JobsTableOdd iCIMS_JobsTableField_1"] \ 下,其余的在td[@class="iCIMS_JobsTableEven iCIMS_JobsTableField_1"].

问题出在哪里,有时会下载标签,有时则不知道发生了什么,我的意思是当我们运行蜘蛛文件两次它正在下载时,另一次它返回一个空列表,如下所示

第一次运行:

2012-07-17 17:05:20+0530 [Preferredtherapy] DEBUG: Crawled (200) <GET https://careers-preftherapy.icims.com/jobs/search?pr=2> (referer: https://careers-preftherapy.icims.com/jobs/search)
[] >>>>>>>odddddd>>>>>>>>>>>>>>>>
[] >>>>>>>Evennnn>>>>>>>>>>>>>>>>

第二次跑

2012-07-17 17:05:20+0530 [Preferredtherapy] DEBUG: Crawled (200) <GET https://careers-preftherapy.icims.com/jobs/search?pr=2> (referer: https://careers-preftherapy.icims.com/jobs/search)
[u'https://careers-preftherapy.icims.com/jobs/1836/job', u'https://careers-preftherapy.icims.com/jobs/1813/job', u'https://careers-preftherapy.icims.com/jobs/1763/job']>>>>>>>odddddd>>>>>>>>>>>>>>>>
[preftherapy.icims.com/jobs/1811/job', u'https://careers-preftherapy.icims.com/jobs/1787/job']>>>>>>>Evennnn>>>>>>>>>>>>>>>>

我的问题是为什么它有时会下载有时不会,请尝试回复我这对我真的很有帮助。

提前致谢.....

4

2 回答 2

0

问题出在哪里,有时会下载标签,有时不会,我不知道发生了什么

要了解正在发生的事情,您应该进行调试。我的猜测是您的 xpath 查询返回一个空列表,因为您有一个意外的页面。

执行以下操作:

def parsePage(self, response):
    hxs = HtmlXPathSelector(response)
    urls_list_odd_id = hxs.select('//table[@class="iCIMS_JobsTable"]/tr/td[@class="iCIMS_JobsTableOdd iCIMS_JobsTableField_1"]/a/@href').extract()
    print urls_list_odd_id,">>>>>>>odddddd>>>>>>>>>>>>>>>>"
    urls_list_even_id = hxs.select('//table[@class="iCIMS_JobsTable"]/tr/td[@class="iCIMS_JobsTableEven iCIMS_JobsTableField_1"]/a/@href').extract()
    print urls_list_odd_id,">>>>>>>Evennnn>>>>>>>>>>>>>>>>"

    if not urls_list_odd_id or not urls_list_odd_id:
        from scrapy.shell import inspect_response
        inspect_response(response)

    urls_list = []
    urls_list.extend(urls_list_odd_id)
    urls_list.extend(urls_list_even_id)
    for i in urls_list:
        yield Request(i.encode('utf-8'), callback=self.parseJob)

当您使用 shell 类型view(response)在浏览器中查看下载的页面时(例如在 Firefox 中),您将能够测试您的 xpath 查询并找出它们为什么不返回任何内容。 是有关scrapy shell的更多信息。

于 2012-07-17T13:54:02.710 回答
0

您可以使用open_in_browser()在浏览器中打开响应:

def parsePage(self, response):
    from scrapy.utils.response import open_in_browser
    open_in_browser(response)
于 2012-07-24T16:03:32.940 回答