我遇到了4天的这个问题,我陷入了死胡同。我想抓取“ http://www.ledcor.com/careers/search-careers ”。在每个职位列表页面(即http://www.ledcor.com/careers/search-careers?page=2)上,我进入每个职位链接并获取职位名称。到目前为止,我有这个工作。
现在,我正试图让蜘蛛进入下一个工作列表页面(例如从http://www.ledcor.com/careers/search-careers?page=2到http://www.ledcor.com/careers/ search-careers?page=3并抓取所有工作)。我的爬取规则不起作用,我不知道什么是错的,什么是缺失的。请帮忙。
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem
class LedcorSpider(CrawlSpider):
name = "ledcor"
allowed_domains = ["www.ledcor.com"]
start_urls = ["http://www.ledcor.com/careers/search-careers"]
rules = [
Rule(SgmlLinkExtractor(allow=("http://www.ledcor.com/careers/search-careers\?page=\d",),restrict_xpaths=('//div[@class="pager"]/a',)), follow=True),
Rule(SgmlLinkExtractor(allow=("http://www.ledcor.com/job\?(.*)",)),callback="parse_items")
]
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
item = CraigslistSampleItem()
item['title'] = hxs.select('//h1/text()').extract()[0].encode('utf-8')
item['link'] = response.url
return item
这是 Items.py
from scrapy.item import Item, Field
class CraigslistSampleItem(Item):
title = Field()
link = Field()
desc = Field()
这是 Pipelines.py
class CraigslistSamplePipeline(object):
def process_item(self, item, spider):
return item
更新:(@blender 建议)它不会爬行
rules = [
Rule(SgmlLinkExtractor(allow=(r"http://www.ledcor.com/careers/search-careers\?page=\d",),restrict_xpaths=('//div[@class="pager"]/a',)), follow=True),
Rule(SgmlLinkExtractor(allow=("http://www.ledcor.com/job\?(.*)",)),callback="parse_items")
]