假设我有一个要抓取的网站。前任。廉价航空
我想在 python 中使用普通请求来抓取第一个假设页面上的数据。如果我最终被服务器阻止,我想切换到代理。我有一个代理服务器列表和一个方法,还有一个用户代理字符串列表。但是,我认为我需要帮助来思考这个问题。
作为参考 uagen() 将返回一个用户代理字符串
proxit() 将返回一个代理
这是我到目前为止所拥有的:
import requests
from proxy_def import *
from http import cookiejar
import time
from socket import error as SocketError
import sys
start_time = time.time()
class BlockAll(cookiejar.CookiePolicy):
return_ok = set_ok = domain_return_ok = path_return_ok = lambda self, *args, **kwargs: False
netscape = True
rfc2965 = hide_cookie2 = False
headers = {'User-Agent': uagen()}
print(headers)
s = requests.Session()
s.cookies.set_policy(BlockAll)
cookies = {'SetCurrency': 'USD'}
sp = proxit()
for i in range(100000000000):
while True:
try:
print('trying on ', sp)
print('with user agent headers', headers)
s.proxies = {"http": sp}
r = s.get("http://www.cheapoair.com", headers=headers, timeout=15, cookies=cookies)
print(i, sp, 'success')
print("--- %s seconds ---" % (time.time() - start_time))
except SocketError as e:
print('passing ', sp)
sp = proxit()
headers = {'User-Agent': uagen()}
print('this is the new proxy ', sp)
print('this is the new headers ', headers)
continue
except requests.ConnectionError as e:
print('passing ', sp)
sp = proxit()
headers = {'User-Agent': uagen()}
print('this is the new proxy ', sp)
print('this is the new headers ', headers)
continue
except requests.Timeout as e:
print('passing ', sp)
sp = proxit()
headers = {'User-Agent': uagen()}
print('this is the new proxy ', sp)
print('this is the new headers ', headers)
continue
except KeyboardInterrupt:
print("The program has been terminated")
sys.exit(1)
break
#print(r.text)
print('all done',
'\n')
我正在寻找的是如何说的想法,从正常请求(不是来自代理)开始,如果最终出现错误(例如被服务器拒绝),切换到代理并重试.
我几乎可以想象它,但不能完全看到它。
我在想,如果我在之后放置一个变量
for i in range(1000000000000):
但在while true:
更新之前,sp
它可能会起作用。另一种可能是声明s.proxies = {"http": ""}
,然后如果我遇到错误,切换到s.poxies = {"http": "proxit()"}
或s.poxies = {"http": "sp"}
谢谢!