1

我需要扫描一组给定的 URL 并在 Python 中获取 HTTP 状态代码,例如“200 OK”。我目前正在使用 urllib 来执行此操作。有没有更快的方法来做到这一点?

Python代码

def get_status(url):
try:
    return urllib.urlopen(url).getcode()

except StandardError :
    return None
4

5 回答 5

4

Couple remarks I would make for faster happier status checking. The first tip would be to use the http HEAD method. This asks the server for just the http headers (including the status code) without having it also serve the body of the page.

Second urllib works but I would recommend using the wonderful Requests library which provides a much nicer api for pretty much everything you would want to do with http.

Last I would use the gevents library to enable you to download each header asynchronously, vastly speeding up the whole process.

于 2012-07-19T17:00:51.167 回答
2

您可能希望以非阻塞方式并行执行此操作。在此处查看 eventlet 库:http: //eventlet.net/。您可以从首页http://eventlet.net/#web-crawler-example中获取一个示例。

于 2012-07-19T16:58:17.267 回答
2

为了速度,尝试使用GRequests异步检查 url(不是一次一个)。

代码

import grequests

urls = [
    'http://www.heroku.com',
    'http://tablib.org',
    'http://httpbin.org',
    'http://python-requests.org',
    'http://kennethreitz.com'
]

rs = (grequests.get(u) for u in urls)
# For even faster status code checks, use the HEAD method instead of GET
# rs = (grequests.head(u) for u in urls)

for r in grequests.map(rs):
    print r.status_code, r.url

输出

200 http://www.heroku.com/
200 http://tablib.org/
200 http://httpbin.org/
200 http://docs.python-requests.org/en/latest/index.html
200 http://kennethreitz.com/
于 2012-07-19T17:13:36.833 回答
1

就在这里。

  1. 使用多个线程同时检查不同的 URL。
  2. 使用实现简单 HTTP 请求的原始套接字。一旦收到 200 响应(或任何其他代码),您就关闭连接,避免不必要的数据传输。
于 2012-07-19T16:59:16.440 回答
0

使用线程。将您的代码放在一个 Thread 类中,并将结果存储在一个全局对象中。调用一堆线程。

于 2012-07-19T16:59:11.867 回答