1

我正在尝试应用 Scrapy (scrapyjs) 来抓取带有脚本的页面,以获得完整加载的页面。我应用了 splash + scrapy 来使用以下代码渲染它。这与直接使用 localhost:8050 服务器的参数完全相同

   script = """
    function main(splash)
      local url = splash.args.url
      assert(splash:go(url))
      assert(splash:wait(0.5))
      return {
        html = splash:html(),
        png = splash:png(),
        har = splash:har(),
      }
    end
    """

    splash_args = {
        'wait': 0.5,
        'url': response.url,
        'images': 1,
        'expand': 1,
        'timeout': 60.0,
        'lua_source': script
    }

    yield SplashRequest(response.url,
                        self.parse_list_other_page,
                        cookies=response.request.cookies,
                        args=splash_args)

响应 html 不包含我需要的元素,但是如果我直接在 localhost:8050 上使用它,则启动服务器运行良好。

你知道问题出在哪里吗?

This is my settings.py
    SPLASH_URL = 'http://127.0.0.1:8050'
    SPIDER_MIDDLEWARES = {
        'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
    }

    # Enable or disable downloader middlewares
    # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
    DOWNLOADER_MIDDLEWARES = {
        'scrapy_splash.SplashCookiesMiddleware': 723,
        'scrapy_splash.SplashMiddleware': 725,
        # scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
        'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    }

    # Crawl responsibly by identifying yourself (and your website) on the 
    user-agent
    USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) 
    AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 
    Safari/537.36"

    SPIDER_MIDDLEWARES = {
        'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
    }

    # Enable or disable downloader middlewares
    # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
    DOWNLOADER_MIDDLEWARES = {
        'scrapy_splash.SplashCookiesMiddleware': 723,
        'scrapy_splash.SplashMiddleware': 725,
        # scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,


'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
4

1 回答 1

2

默认端点是'render.json';要使用“lua_source”参数(即运行 Lua 脚本),您必须使用“执行”端点:

yield SplashRequest(response.url, endpoint='execute',
                    self.parse_list_other_page,
                    cookies=response.request.cookies,
                    args=splash_args)
于 2017-05-18T16:08:37.417 回答