将 URL 存储在一个集合中,以确保 O(1) 用于查找项目,然后将其搁置。在如此数量的 URL 下,存储和恢复将花费很少的时间和内存:
import shelve
# Write URLS to shelve
urls= ['http://www.airmagnet.com/', 'http://www.alcatel-lucent.com/',
'http://www.ami.com/', 'http://www.apcc.com/', 'http://www.stk.com/',
'http://www.apani.com/', 'http://www.apple.com/',
'http://www.arcoide.com/', 'http://www.areca.com.tw/',
'http://www.argus-systems.com/', 'http://www.ariba.com/',
'http://www.asus.com.tw/']
s=set(urls) # Store URLs as set - Search is O(1)
sh=shelve.open('/tmp/shelve.tmp') # Dump set (as one unit) to shelve file
sh['urls']=s
sh.close()
sh=shelve.open('/tmp/shelve.tmp') # Retrieve set from file
s=sh['urls']
print 'http://www.apple.com/' in s # True
print 'http://matan.name/' in s # False
这种方法非常快:
import random
import string
import shelve
import datetime
urls=[''.join(random.choice(string.ascii_uppercase + string.digits) for x in range(50))
for i in range(40000)]
s=set(urls)
start=datetime.datetime.now()
sh=shelve.open('/tmp/test.shelve')
sh['urls']=urls
end=datetime.datetime.now()
print end-start