scrapy - 使用scrapy顺序爬取网站

Question

有没有办法告诉scrapy根据二级页面中的条件停止爬行？我正在执行以下操作：

我有一个 start_url 开头（第一级页面）
我使用 parse(self, response) 从 start_url 中提取了一组 url
然后我使用带有回调的请求将链接添加为 parseDetailPage(self, response)
在 parseDetail （第二级页面）下，我知道我是否可以停止爬行

现在我正在使用 CloseSpider() 来完成此操作，但问题是当我开始抓取二级页面时，要解析的 url 已经排队，我不知道如何将它们从队列中删除。有没有办法顺序爬取链接列表，然后能够在 parseDetailPage 中停止？

global job_in_range    
start_urls = []
start_urls.append("http://sfbay.craigslist.org/sof/")
def __init__(self):
    self.job_in_range = True
def parse(self, response):
    hxs = HtmlXPathSelector(response)
    results = hxs.select('//blockquote[@id="toc_rows"]')
    items = []
    if results:
        links = results.select('.//p[@class="row"]/a/@href')
        for link in links:
            if link is self.end_url:
                break;
            nextUrl = link.extract()
            isValid = WPUtil.validateUrl(nextUrl);
            if isValid:
                item = WoodPeckerItem()
                item['url'] = nextUrl
                item = Request(nextUrl, meta={'item':item},callback=self.parseDetailPage)
                items.append(item)
    else:
        self.error.log('Could not parse the document')
    return items

def parseDetailPage(self, response):
    if self.job_in_range is False:
        raise CloseSpider('End date reached - No more crawling for ' + self.name)
    hxs = HtmlXPathSelector(response)
    print response
    body = hxs.select('//article[@id="pagecontainer"]/section[@class="body"]')
    item = response.meta['item']
    item['postDate'] = body.select('.//section[@class="userbody"]/div[@class="postinginfos"]/p')[1].select('.//date/text()')[0].extract()
    if item['jobTitle'] is 'Admin':
        self.job_in_range = False
        raise CloseSpider('Stop crawling')
    item['jobTitle'] = body.select('.//h2[@class="postingtitle"]/text()')[0].extract()
    item['description'] = body.select(str('.//section[@class="userbody"]/section[@id="postingbody"]')).extract()
    return item

score 0 · Accepted Answer

你的意思是你想停止蜘蛛并在不解析已经解析的url的情况下恢复它？如果是这样，您可以尝试设置JOB_DIR 设置。此设置可以将 request.queue 保存在磁盘上的指定文件中。

scrapy - 使用scrapy顺序爬取网站

1 回答 1

Related

Reference