我想抓取整个网站并有条件地提取链接。
正如此链接中所建议的,我尝试了多个规则,但它不起作用。Scrapy 不会抓取所有页面
我尝试使用此代码,但它没有废弃任何细节。
class BusinesslistSpider(CrawlSpider):
name = 'businesslist'
allowed_domains = ['www.businesslist.ae']
start_urls = ['http://www.businesslist.ae/']
rules = (
Rule(SgmlLinkExtractor()),
Rule(SgmlLinkExtractor(allow=r'company/(\d)+/'), callback='parse_item'),
)
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
hxs = HtmlXPathSelector(response)
i = BusinesslistItem()
company = hxs.select('//div[@class="text companyname"]/strong/text()').extract()[0]
address = hxs.select('//div[@class="text location"]/text()').extract()[0]
location = hxs.select('//div[@class="text location"]/a/text()').extract()[0]
i['url'] = response.url
i['company'] = company
i['address'] = address
i['location'] = location
return i
就我而言,它不适用第二条规则,因此它不会解析详细信息页面。