我正在编写一个应该多次获取 URL 的小爬虫,我希望所有线程同时(同时)运行。
我已经写了一小段代码应该做到这一点。
import thread
from urllib2 import Request, urlopen, URLError, HTTPError
def getPAGE(FetchAddress):
attempts = 0
while attempts < 2:
req = Request(FetchAddress, None)
try:
response = urlopen(req, timeout = 8) #fetching the url
print "fetched url %s" % FetchAddress
except HTTPError, e:
print 'The server didn\'t do the request.'
print 'Error code: ', str(e.code) + " address: " + FetchAddress
time.sleep(4)
attempts += 1
except URLError, e:
print 'Failed to reach the server.'
print 'Reason: ', str(e.reason) + " address: " + FetchAddress
time.sleep(4)
attempts += 1
except Exception, e:
print 'Something bad happened in gatPAGE.'
print 'Reason: ', str(e.reason) + " address: " + FetchAddress
time.sleep(4)
attempts += 1
else:
try:
return response.read()
except:
"there was an error with response.read()"
return None
return None
url = ("http://www.domain.com",)
for i in range(1,50):
thread.start_new_thread(getPAGE, url)
从 apache 日志来看,线程似乎不是同时运行的,请求之间有一点差距,几乎无法检测到,但我可以看到线程并不是真正并行的。
我读过 GIL,有没有办法绕过它而不调用 C\C++ 代码?我真的不明白 GIL 是如何实现线程化的?python 基本上在下一个线程完成前一个线程后立即解释它?
谢谢。