2

因为三天以来,我试图将各自的 start_urs 保存在元属性中,以将其作为项目传递给 scrapy 中的后续请求,因此我可以使用 start_url 调用字典来用其他数据填充我的输出。实际上它应该很简单,因为它在文档中进行了解释......

google scrapy 组中有一个讨论,这里也有一个问题,但我无法让它运行:(

我是scrapy的新手,我认为它是一个很棒的框架,但是对于我的项目,我必须知道所有请求的start_urls,它看起来很复杂。

我真的很感激一些帮助!

目前我的代码如下所示:

class example(CrawlSpider):

    name = 'example'
    start_urls = ['http://www.example.com']

    rules = (
    Rule(SgmlLinkExtractor(allow=('/blablabla/', )), callback='parse_item'),
    )

    def parse(self, response):
        for request_or_item in super(example, self).parse(response):
            if isinstance(request_or_item, Request):
                request_or_item = request_or_item.replace(meta = {'start_url':   response.meta['start_url']})
            yield request_or_item

    def make_requests_from_url(self, url):
         return Request(url, dont_filter=True, meta = {'start_url': url})


    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        item = testItem()
        print response.request.meta, response.url
4

1 回答 1

2

我想删除这个答案,因为它不能解决 OP 的问题,但我想把它作为一个简单的例子。


警告

在编写爬虫规则时,避免使用parseas callback,因为它CrawlSpider使用parse方法本身来实现其逻辑。因此,如果您覆盖该parse方法,爬虫将不再工作。

改用BaseSpider

class Spider(BaseSpider):

    name = "domain_spider"


    def start_requests(self):

        last_domain_id = 0
        chunk_size = 10
        cursor = settings.db.cursor()

        while True:
            cursor.execute("""
                    SELECT domain_id, domain_url  
                    FROM domains  
                    WHERE domain_id > %s AND scraping_started IS NULL  
                    LIMIT %s
                """, (last_domain_id, chunk_size))
            self.log('Requesting %s domains after %s' % (chunk_size, last_domain_id))
            rows = cursor.fetchall()
            if not rows:
                self.log('No more domains to scrape.')
                break

            for domain_id, domain_url in rows:
                last_domain_id = domain_id
                request = self.make_requests_from_url(domain_url)
                item = items.Item()
                item['start_url'] = domain_url
                item['domain_id'] = domain_id
                item['domain'] = urlparse.urlparse(domain_url).hostname
                request.meta['item'] = item

                cursor.execute("""
                        UPDATE domains  
                        SET scraping_started = %s
                        WHERE domain_id = %s  
                    """, (datetime.now(), domain_id))

                yield request

    ...
于 2012-08-03T05:00:38.853 回答