7

我使用scrapy-splash 来构建我的蜘蛛。现在我需要维护会话,所以我使用了 scrapy.downloadermiddlewares.cookies.CookiesMiddleware 并处理了 set-cookie 标头。我知道它会处理 set-cookie 标头,因为我设置了 COOKIES_DEBUG=True ,这会导致 CookeMiddleware 打印出有关 set-cookie 标头的信息。

问题:当我还将 Splash 添加到图片时,set-cookie 打印输出消失了,实际上我得到的响应标题是 {'Date': ['Sun, 25 Sep 2016 12:09:55 GMT'], '内容类型': ['text/html; charset=utf-8'], 'Server': ['TwistedWeb/16.1.1']} 这与使用 TwistedWeb 的启动渲染引擎有关。

是否有任何指令告诉飞溅也给我原始响应标头?

4

1 回答 1

9

要获取原始响应标头,您可以编写Splash Lua 脚本;请参阅scrapy-splash README 中的示例:

使用 Lua 脚本获取 HTML 响应,其中 cookie、标头、正文和方法设置为正确的值;lua_source 参数值缓存在 Splash 服务器上,不会随每个请求一起发送(它需要 Splash 2.1+):

import scrapy
from scrapy_splash import SplashRequest

script = """
function main(splash)
  splash:init_cookies(splash.args.cookies)
  assert(splash:go{
    splash.args.url,
    headers=splash.args.headers,
    http_method=splash.args.http_method,
    body=splash.args.body,
    })
  assert(splash:wait(0.5))

  local entries = splash:history()
  local last_response = entries[#entries].response
  return {
    url = splash:url(),
    headers = last_response.headers,
    http_status = last_response.status,
    cookies = splash:get_cookies(),
    html = splash:html(),
  }
end
"""

class MySpider(scrapy.Spider):


    # ...
        yield SplashRequest(url, self.parse_result,
            endpoint='execute',
            cache_args=['lua_source'],
            args={'lua_source': script},
            headers={'X-My-Header': 'value'},
        )

    def parse_result(self, response):
        # here response.body contains result HTML;
        # response.headers are filled with headers from last
        # web page loaded to Splash;
        # cookies from all responses and from JavaScript are collected
        # and put into Set-Cookie response header, so that Scrapy
        # can remember them.

scrapy-splash 还提供了用于 cookie 处理的内置助手;只要按照自述文件中的描述配置了scrapy-splash,它们就会在此示例中启用。

于 2016-09-25T20:10:33.337 回答