scrapy - SgmlLinkExtractor 的 Scrapy 多个规则不起作用

Question

我想抓取整个网站并有条件地提取链接。

正如此链接中所建议的，我尝试了多个规则，但它不起作用。Scrapy 不会抓取所有页面

我尝试使用此代码，但它没有废弃任何细节。

class BusinesslistSpider(CrawlSpider):
    name = 'businesslist'
    allowed_domains = ['www.businesslist.ae']
    start_urls = ['http://www.businesslist.ae/']

    rules = (
        Rule(SgmlLinkExtractor()),
        Rule(SgmlLinkExtractor(allow=r'company/(\d)+/'), callback='parse_item'),
    )

    def parse_item(self, response):
        self.log('Hi, this is an item page! %s' % response.url)
        hxs = HtmlXPathSelector(response)
        i = BusinesslistItem()
        company = hxs.select('//div[@class="text companyname"]/strong/text()').extract()[0]
        address = hxs.select('//div[@class="text location"]/text()').extract()[0]
        location = hxs.select('//div[@class="text location"]/a/text()').extract()[0]
        i['url'] = response.url
        i['company'] = company
        i['address'] = address
        i['location'] = location
        return i

就我而言，它不适用第二条规则，因此它不会解析详细信息页面。

score 1 · Accepted Answer

第一条规则Rule(SgmlLinkExtractor())匹配每个链接，scrapy 忽略第二条。

尝试以下操作：

...
start_urls = ['http://www.businesslist.ae/sitemap.html']
...
# Rule(SgmlLinkExtractor()),

scrapy - SgmlLinkExtractor 的 Scrapy 多个规则不起作用

1 回答 1

Related

Reference