我需要扫描一组给定的 URL 并在 Python 中获取 HTTP 状态代码,例如“200 OK”。我目前正在使用 urllib 来执行此操作。有没有更快的方法来做到这一点?
Python代码
def get_status(url):
try:
return urllib.urlopen(url).getcode()
except StandardError :
return None
Couple remarks I would make for faster happier status checking. The first tip would be to use the http HEAD
method. This asks the server for just the http headers (including the status code) without having it also serve the body of the page.
Second urllib works but I would recommend using the wonderful Requests library which provides a much nicer api for pretty much everything you would want to do with http.
Last I would use the gevents library to enable you to download each header asynchronously, vastly speeding up the whole process.
您可能希望以非阻塞方式并行执行此操作。在此处查看 eventlet 库:http: //eventlet.net/。您可以从首页http://eventlet.net/#web-crawler-example中获取一个示例。
为了速度,尝试使用GRequests异步检查 url(不是一次一个)。
import grequests
urls = [
'http://www.heroku.com',
'http://tablib.org',
'http://httpbin.org',
'http://python-requests.org',
'http://kennethreitz.com'
]
rs = (grequests.get(u) for u in urls)
# For even faster status code checks, use the HEAD method instead of GET
# rs = (grequests.head(u) for u in urls)
for r in grequests.map(rs):
print r.status_code, r.url
200 http://www.heroku.com/
200 http://tablib.org/
200 http://httpbin.org/
200 http://docs.python-requests.org/en/latest/index.html
200 http://kennethreitz.com/
就在这里。
使用线程。将您的代码放在一个 Thread 类中,并将结果存储在一个全局对象中。调用一堆线程。