0

我正在抓取的网站具有设置 cookie 并在后端检查它以确保启用 js 的 javascript。从 html 代码中提取 cookie 很简单,但是在 scrapy 中设置它似乎是个问题。所以我的代码是:

from scrapy.contrib.spiders.init import InitSpider

class TestSpider(InitSpider):
    ...
    rules = (Rule(SgmlLinkExtractor(allow=('products/./index\.html', )), callback='parse_page'),)

    def init_request(self):
        return Request(url = self.init_url, callback=self.parse_js)

    def parse_js(self, response):
        match = re.search('setCookie\(\'(.+?)\',\s*?\'(.+?)\',', response.body, re.M)
        if match:
            cookie = match.group(1)
            value = match.group(2)
        else:
            raise BaseException("Did not find the cookie", response.body)
        return Request(url=self.test_page, callback=self.check_test_page, cookies={cookie:value})

    def check_test_page(self, response):
        if 'Welcome' in response.body:
            self.initialized()

    def parse_page(self, response):
        scraping....

我可以看到内容在 中可用check_test_page,cookie 运行良好。但它甚至从来没有得到,parse_page因为没有正确的 cookie 的 CrawlSpider 看不到任何链接。有没有办法在抓取会话期间设置 cookie?还是我必须使用 BaseSpider 并将 cookie 手动添加到每个请求中?

一个不太理想的选择是通过scrapy配置文件以某种方式设置cookie(值似乎永远不会改变)。那可能吗?

4

2 回答 2

1

我以前没用过InitSpider

scrapy.contrib.spiders.init.InitSpider查看我看到的代码:

def initialized(self, response=None):
    """This method must be set as the callback of your last initialization
    request. See self.init_request() docstring for more info.
    """
    self._init_complete = True
    reqs = self._postinit_reqs[:]
    del self._postinit_reqs
    return reqs

def init_request(self):
    """This function should return one initialization request, with the
    self.initialized method as callback. When the self.initialized method
    is called this spider is considered initialized. If you need to perform
    several requests for initializing your spider, you can do so by using
    different callbacks. The only requirement is that the final callback
    (of the last initialization request) must be self.initialized. 

    The default implementation calls self.initialized immediately, and
    means that no initialization is needed. This method should be
    overridden only when you need to perform requests to initialize your
    spider
    """
    return self.initialized()

你写了:

我可以看到内容在 中可用check_test_page,cookie 运行良好。但它甚至永远不会到达,parse_page因为 CrawlSpider没有正确的 cookie 就看不到任何链接。

我认为parse_page没有被调用,因为你没有发出请求self.initialized作为回调。

我认为这应该有效:

def check_test_page(self, response):
    if 'Welcome' in response.body:
        return self.initialized()
于 2012-08-14T16:18:16.943 回答
0

原来InitSpider是一个BaseSpider。所以看起来像 1) 在这种情况下无法使用 CrawlSpider 2) 无法设置粘性 cookie

于 2012-08-24T11:18:02.623 回答