proxy - 使用scrapy + splash爬取时如何使用多个代理？

翻译自：https://stackoverflow.com/questions/38365751 2016-07-14T04:54:25.077

1819 次

0

我们用scrapy + splash爬行，我们想使用多个代理。但 splash 仅支持单个代理https://splash.readthedocs.io/en/stable/api.html#proxy-profiles。

[proxy]

; required
host=proxy.crawlera.com
port=8010

; optional, default is no auth
username=username
password=password

; optional, default is HTTP. Allowed values are HTTP and SOCKS5
type=HTTP

使用scrapy + splash爬取时如何使用多个代理？

1 回答 1

1

有几种选择：

使用多个配置文件（正如 Rafael Almeida 在评论中建议的那样）；
为每个请求传递不同的代理 URL（请参阅http://splash.readthedocs.io/en/stable/api.html#arg-proxy）；
编写一个 Splash Lua 脚本并在splash:on_request回调中使用 request:set_proxy - 文档中有一个示例。这样，您可以为页面初始化的不同请求设置不同的代理，而不仅仅是每个呈现页面的单个代理。我不知道在 phantomjs 或 selenium 等其他浏览器自动化工具中可以做到这一点。

于 2016-09-25T20:20:48.807 回答