out-of-memory - 使用 Splash 截取某个 URL 的屏幕截图

Question

我正在尝试使用 Scrapy Splash 的render.png端点截取以下 URL 的屏幕截图：

https://www.laithwaites.co.uk/product/Kilikanoon-Baroota-Shiraz-2014/66877

在实践中，我使用发出请求python-requests，但是，即使我在浏览器中执行此操作以进行测试，它也会将 cpu 使用率推到 100% 以上（由“顶部”测量），挂起很长时间，然后最终迫使 Splash 崩溃。估计是内存不够了

我尝试maxrss从 500 增加到 1500，但这没有帮助。我还尝试调整端点的等待/超时参数，render.png但没有改变结果。

如何使用 Splash 截取此页面的屏幕截图？

score 0 · Accepted Answer

这似乎是JS引擎的一些问题。如果你关闭 JS，你至少可以得到一个截图：

import requests

script = """
function main(splash)
  local url = splash.args.url
  splash.js_enabled = false
  assert(splash:go(url))
  return splash:png()
end
"""

resp = requests.post('http://localhost:8050/execute', json={
    'lua_source': script,
    'url': '<url>',
})

如果您启用详细日志记录 ( docker run -it -p8050:8050 scrapinghub/splash -v3)，您可以看到 Splash 在从云端下载某些fetch.js文件后挂起。可能它包含一些使 Splash 挂起的代码。您可以尝试仅过滤掉这个文件（中止下载），而不是禁用 JS：

function main(splash)
  local url = splash.args.url
  splash:on_request(function (req) 
    if req.url:find('fetch.js') ~= nil then
      req.abort()
    end
  end)
  assert(splash:go(url))
  return splash:png()
end

上面的脚本对我有用。

out-of-memory - 使用 Splash 截取某个 URL 的屏幕截图

1 回答 1

Related

Reference