python - 如何将代理添加到 BeautifulSoup 爬虫

Question

这些是python爬虫中的定义：

from __future__ import with_statement

from eventlet.green import urllib2
import eventlet
import re
import urlparse
from bs4 import BeautifulSoup, SoupStrainer
import sqlite3
import datetime

如何将旋转代理（每个打开线程一个代理）添加到在 BeautifulSoup 上工作的递归 cralwer？

如果我使用 Mechanise 的浏览器，我知道如何添加代理：

br = Browser()
br.set_proxies({'http':'http://username:password@proxy:port',
'https':'https://username:password@proxy:port'})

但我想具体了解 BeautifulSoup 需要什么样的解决方案。

非常感谢您的帮助！

score 3 · Accepted Answer

请注意，现在有一个不太复杂的解决方案可用，在这里共享：

import requests

proxies = {"http": "http://10.10.1.10:3128",
           "https": "http://10.10.1.10:1080"}

requests.get("http://example.org", proxies=proxies)

然后从请求响应中照常做你的beautifulsoup。

因此，如果您想要具有不同代理的单独线程，您只需为每个请求调用不同的字典条目（例如，从字典列表中）。

当您现有的包使用已经是 requests / bs4 时，这似乎更直接实施，因为它只是**kwargs您现有requests.get()调用的额外添加。您不必为每个线程初始化/安装/打开单独的 urllib 处理程序。

score 2 · Accepted Answer

看看 BeautifulSoup 使用 HTTP 代理的例子

http://monzool.net/blog/2007/10/15/html-parsing-with-beautiful-soup/

python - 如何将代理添加到 BeautifulSoup 爬虫

2 回答 2

Related

Reference