我创建了一个扩展 CrawlSpider 的蜘蛛,并遵循了http://scrapy.readthedocs.org/en/latest/topics/spiders.html上的建议
问题是我需要解析起始 url(恰好与主机名一致)和它包含的一些链接。
所以我定义了一个规则,比如:rules = [Rule(SgmlLinkExtractor(allow=['/page/d+']), callback='parse_items', follow=True)]
,但什么也没发生。
然后我尝试定义一组规则,例如:rules = [Rule(SgmlLinkExtractor(allow=['/page/d+']), callback='parse_items', follow=True), Rule(SgmlLinkExtractor(allow=['/']), callback='parse_items', follow=True)]
. 现在的问题是蜘蛛会解析所有内容。
我如何告诉蜘蛛解析 _start_url_ 以及它包含的一些链接?
更新:
我试图覆盖该parse_start_url
方法,所以现在我可以从起始页获取数据,但它仍然不遵循使用 a 定义的链接Rule
:
class ExampleSpider(CrawlSpider):
name = 'TechCrunchCrawler'
start_urls = ['http://techcrunch.com']
allowed_domains = ['techcrunch.com']
rules = [Rule(SgmlLinkExtractor(allow=['/page/d+']), callback='parse_links', follow=True)]
def parse_start_url(self, response):
print '++++++++++++++++++++++++parse start url++++++++++++++++++++++++'
return self.parse_links(response)
def parse_links(self, response):
print '++++++++++++++++++++++++parse link called++++++++++++++++++++++++'
articles = []
for i in HtmlXPathSelector(response).select('//h2[@class="headline"]/a'):
article = Article()
article['title'] = i.select('./@title').extract()
article['link'] = i.select('./@href').extract()
articles.append(article)
return articles