关于您的代码的一些建议:
当您编译正则表达式模式时,请确保您还使用了编译后的对象。并避免在每个处理循环中编译您的正则表达式。
pattern = re.compile('"></a>(.+?)</dd><dt>')
# ...
links = pattern.findall(html)
如果您想避免使用其他框架,最好的解决方案是加快速度,因此请使用标准线程库以使多个 HTTP 连接并行进行。
像这样的东西:
from Queue import Queue
from threading import Thread
import urllib2
import re
# Work queue where you push the URLs onto - size 100
url_queue = Queue(10)
pattern = re.compile('"></a>(.+?)</dd><dt>')
def worker():
'''Gets the next url from the queue and processes it'''
while True:
url = url_queue.get()
print url
html = urllib2.urlopen(url).read()
print html[:10]
links = pattern.findall(html)
if len(links) > 0:
print links
url_queue.task_done()
# Start a pool of 20 workers
for i in xrange(20):
t = Thread(target=worker)
t.daemon = True
t.start()
# Change this to read your links and queue them for processing
for url in xrange(100):
url_queue.put("http://www.ravn.co.uk")
# Block until everything is finished.
url_queue.join()