web-scraping - Scrapy Shell 和 Scrapy Splash

Question

我们一直在使用scrapy-splash中间件将抓取的 HTML 源代码通过在Splashdocker 容器内运行的 javascript 引擎传递。

如果我们想在蜘蛛中使用 Splash，我们配置几个必需的项目设置并产生一个Request指定的特定meta参数：

yield Request(url, self.parse_result, meta={
    'splash': {
        'args': {
            # set rendering arguments here
            'html': 1,
            'png': 1,

            # 'url' is prefilled from request url
        },

        # optional parameters
        'endpoint': 'render.json',  # optional; default is render.json
        'splash_url': '<url>',      # overrides SPLASH_URL
        'slot_policy': scrapyjs.SlotPolicy.PER_DOMAIN,
    }
})

这按记录工作。但是，我们如何scrapy-splash在Scrapy Shell中使用呢？

score 38 · Accepted Answer

just wrap the url you want to shell to in splash http api.

So you would want something like:

scrapy shell 'http://localhost:8050/render.html?url=http://domain.com/page-with-javascript.html&timeout=10&wait=0.5'

where localhost:port is where your splash service is running
url is url you want to crawl and dont forget to urlquote it!
render.html is one of the possible http api endpoints, returns redered html page in this case
timeout time in seconds for timeout
wait time in seconds to wait for javascript to execute before reading/saving the html.

score 19 · Accepted Answer

您可以scrapy shell在已配置的 Scrapy 项目中不带参数运行，然后创建req = scrapy_splash.SplashRequest(url, ...)并调用fetch(req).

score 0 · Accepted Answer

对于使用 Docker Toolbox 的 windows 用户：

将单引号改为双引号以防止invalid hostname:http错误。
将 localhost 更改为鲸鱼徽标下方的 docker ip 地址。对我来说是192.168.99.100。

最后我得到了这个：

scrapy shell "http://192.168.99.100:8050/render.html?url="https://samplewebsite.com/category/banking-insurance-financial-services/""

web-scraping - Scrapy Shell 和 Scrapy Splash

3 回答 3

Related

Reference