1

我想抓取整个网站并有条件地提取链接。

正如此链接中所建议的,我尝试了多个规则,但它不起作用。Scrapy 不会抓取所有页面

我尝试使用此代码,但它没有废弃任何细节。

class BusinesslistSpider(CrawlSpider):
    name = 'businesslist'
    allowed_domains = ['www.businesslist.ae']
    start_urls = ['http://www.businesslist.ae/']

    rules = (
        Rule(SgmlLinkExtractor()),
        Rule(SgmlLinkExtractor(allow=r'company/(\d)+/'), callback='parse_item'),
    )

    def parse_item(self, response):
        self.log('Hi, this is an item page! %s' % response.url)
        hxs = HtmlXPathSelector(response)
        i = BusinesslistItem()
        company = hxs.select('//div[@class="text companyname"]/strong/text()').extract()[0]
        address = hxs.select('//div[@class="text location"]/text()').extract()[0]
        location = hxs.select('//div[@class="text location"]/a/text()').extract()[0]
        i['url'] = response.url
        i['company'] = company
        i['address'] = address
        i['location'] = location
        return i

就我而言,它不适用第二条规则,因此它不会解析详细信息页面。

4

1 回答 1

1

第一条规则Rule(SgmlLinkExtractor())匹配每个链接,scrapy 忽略第二条。

尝试以下操作:

...
start_urls = ['http://www.businesslist.ae/sitemap.html']
...
# Rule(SgmlLinkExtractor()),
于 2013-05-22T08:44:40.990 回答