使用我忽略的 urllib2 时,是否有一种简单的方法来缓存内容,或者我必须自己滚动?
7 回答
如果您不介意在稍低的级别上工作,httplib2 ( https://github.com/httplib2/httplib2 ) 是一个包含缓存功能的优秀 HTTP 库。
这个 ActiveState Python 配方可能会有所帮助: http ://code.activestate.com/recipes/491261/
您可以使用装饰器功能,例如:
class cache(object):
def __init__(self, fun):
self.fun = fun
self.cache = {}
def __call__(self, *args, **kwargs):
key = str(args) + str(kwargs)
try:
return self.cache[key]
except KeyError:
self.cache[key] = rval = self.fun(*args, **kwargs)
return rval
except TypeError: # incase key isn't a valid key - don't cache
return self.fun(*args, **kwargs)
并按照以下方式定义一个函数:
@cache
def get_url_src(url):
return urllib.urlopen(url).read()
这是假设您不关注 HTTP 缓存控制,而只是想在应用程序运行期间缓存页面
我一直在使用 httplib2(它在处理 HTTP 缓存和身份验证方面做得很好)和 urllib2(它位于 stdlib 中,具有可扩展的接口并支持 HTTP 代理服务器)之间纠结。
ActiveState 配方开始向 urllib2 添加缓存支持,但只是以非常原始的方式。它无法实现存储机制的可扩展性,无法对文件系统支持的存储进行硬编码。它也不支持 HTTP 缓存标头。
为了将 httplib2 缓存和 urllib2 可扩展性的最佳特性结合在一起,我调整了 ActiveState 配方以实现与 httplib2 中的大部分相同的缓存功能。该模块在 jaraco.net 中为jaraco.net.http.caching。链接指向在撰写本文时存在的模块。虽然该模块目前是较大的 jaraco.net 包的一部分,但它没有包内依赖项,因此请随意将模块拉出并在您自己的项目中使用它。
或者,如果你有 Python 2.6 或更高版本,你可以easy_install jaraco.net>=1.3
使用 CachingHandler 和caching.quick_test()
.
"""Quick test/example of CacheHandler"""
import logging
import urllib2
from httplib2 import FileCache
from jaraco.net.http.caching import CacheHandler
logging.basicConfig(level=logging.DEBUG)
store = FileCache(".cache")
opener = urllib2.build_opener(CacheHandler(store))
urllib2.install_opener(opener)
response = opener.open("http://www.google.com/")
print response.headers
print "Response:", response.read()[:100], '...\n'
response.reload(store)
print response.headers
print "After reload:", response.read()[:100], '...\n'
请注意,jaraco.util.http.caching 没有为缓存的后备存储提供规范,而是遵循 httplib2 使用的接口。因此,httplib2.FileCache 可以直接与 urllib2 和 CacheHandler 一起使用。此外,为 httplib2 设计的其他后备缓存应该可供 CacheHandler 使用。
我正在寻找类似的东西,并遇到了danivo 发布的“Recipe 491261: Caching and throttling for urllib2”。问题是我真的不喜欢缓存代码(大量重复,大量手动加入文件路径而不是使用 os.path.join,使用静态方法,非常 PEP8'sih,以及我试图避免的其他事情)
代码稍微好一点(无论如何我认为),并且在功能上大致相同,只是添加了一些内容 - 主要是“recache”方法(示例用法可以在此处查看,或在if __name__ == "__main__":
代码末尾的部分中查看)。
最新版本可以在http://github.com/dbr/tvdb_api/blob/master/cache.py找到,我将它粘贴在这里以供后代使用(删除了我的应用程序特定的标头):
#!/usr/bin/env python
"""
urllib2 caching handler
Modified from http://code.activestate.com/recipes/491261/ by dbr
"""
import os
import time
import httplib
import urllib2
import StringIO
from hashlib import md5
def calculate_cache_path(cache_location, url):
"""Checks if [cache_location]/[hash_of_url].headers and .body exist
"""
thumb = md5(url).hexdigest()
header = os.path.join(cache_location, thumb + ".headers")
body = os.path.join(cache_location, thumb + ".body")
return header, body
def check_cache_time(path, max_age):
"""Checks if a file has been created/modified in the [last max_age] seconds.
False means the file is too old (or doesn't exist), True means it is
up-to-date and valid"""
if not os.path.isfile(path):
return False
cache_modified_time = os.stat(path).st_mtime
time_now = time.time()
if cache_modified_time < time_now - max_age:
# Cache is old
return False
else:
return True
def exists_in_cache(cache_location, url, max_age):
"""Returns if header AND body cache file exist (and are up-to-date)"""
hpath, bpath = calculate_cache_path(cache_location, url)
if os.path.exists(hpath) and os.path.exists(bpath):
return(
check_cache_time(hpath, max_age)
and check_cache_time(bpath, max_age)
)
else:
# File does not exist
return False
def store_in_cache(cache_location, url, response):
"""Tries to store response in cache."""
hpath, bpath = calculate_cache_path(cache_location, url)
try:
outf = open(hpath, "w")
headers = str(response.info())
outf.write(headers)
outf.close()
outf = open(bpath, "w")
outf.write(response.read())
outf.close()
except IOError:
return True
else:
return False
class CacheHandler(urllib2.BaseHandler):
"""Stores responses in a persistant on-disk cache.
If a subsequent GET request is made for the same URL, the stored
response is returned, saving time, resources and bandwidth
"""
def __init__(self, cache_location, max_age = 21600):
"""The location of the cache directory"""
self.max_age = max_age
self.cache_location = cache_location
if not os.path.exists(self.cache_location):
os.mkdir(self.cache_location)
def default_open(self, request):
"""Handles GET requests, if the response is cached it returns it
"""
if request.get_method() is not "GET":
return None # let the next handler try to handle the request
if exists_in_cache(
self.cache_location, request.get_full_url(), self.max_age
):
return CachedResponse(
self.cache_location,
request.get_full_url(),
set_cache_header = True
)
else:
return None
def http_response(self, request, response):
"""Gets a HTTP response, if it was a GET request and the status code
starts with 2 (200 OK etc) it caches it and returns a CachedResponse
"""
if (request.get_method() == "GET"
and str(response.code).startswith("2")
):
if 'x-local-cache' not in response.info():
# Response is not cached
set_cache_header = store_in_cache(
self.cache_location,
request.get_full_url(),
response
)
else:
set_cache_header = True
#end if x-cache in response
return CachedResponse(
self.cache_location,
request.get_full_url(),
set_cache_header = set_cache_header
)
else:
return response
class CachedResponse(StringIO.StringIO):
"""An urllib2.response-like object for cached responses.
To determine if a response is cached or coming directly from
the network, check the x-local-cache header rather than the object type.
"""
def __init__(self, cache_location, url, set_cache_header=True):
self.cache_location = cache_location
hpath, bpath = calculate_cache_path(cache_location, url)
StringIO.StringIO.__init__(self, file(bpath).read())
self.url = url
self.code = 200
self.msg = "OK"
headerbuf = file(hpath).read()
if set_cache_header:
headerbuf += "x-local-cache: %s\r\n" % (bpath)
self.headers = httplib.HTTPMessage(StringIO.StringIO(headerbuf))
def info(self):
"""Returns headers
"""
return self.headers
def geturl(self):
"""Returns original URL
"""
return self.url
def recache(self):
new_request = urllib2.urlopen(self.url)
set_cache_header = store_in_cache(
self.cache_location,
new_request.url,
new_request
)
CachedResponse.__init__(self, self.cache_location, self.url, True)
if __name__ == "__main__":
def main():
"""Quick test/example of CacheHandler"""
opener = urllib2.build_opener(CacheHandler("/tmp/"))
response = opener.open("http://google.com")
print response.headers
print "Response:", response.read()
response.recache()
print response.headers
print "After recache:", response.read()
main()
Yahoo Developer Network 上的这篇文章 - http://developer.yahoo.com/python/python-caching.html - 描述了如何将通过 urllib 进行的 http 调用缓存到内存或磁盘。
@dbr:您可能还需要添加 https 响应缓存:
def https_response(self, request, response):
return self.http_response(request,response)