1

这是有问题的代码(一个非常简单的爬虫),该文件是一个 url 列表,通常 > 1000。

import sys, gevent
from gevent import monkey
from gevent.pool import Pool
import httplib, socket
from urlparse import urlparse
from time import time

pool = Pool(100)

monkey.patch_all(thread=False)

count = 0
size = 0
failures = 0

global_timeout = 5
socket.setdefaulttimeout(global_timeout)

def process(ourl, mode = 'GET'):
    global size, failures, global_timeout, count
    try:
        url = urlparse(ourl)
        start = time()
        conn = httplib.HTTPConnection(url.netloc, timeout = global_timeout)
        conn.request(mode, ourl)
        res = conn.getresponse()
        req = res.read()
        end = time()
        bytes = len(req)
        took = end - start
        print mode, ourl, bytes, took
        size = size + len(req)
        count += 1
    except Exception, e:
        failures += 1

start = time()

gevent.core.dns_init()
print "spawning..."
for url in open('domains'):
    pool.spawn(process, url.rstrip())
print "done...joining..."
pool.join()
print "complete"

end = time()
took = end - start
rate = size / took
print "It took %.2f seconds to process %d urls." % (took, count)
print rate, " bytes/sec"
print rate/1024, " KB/sec"
print rate/1048576, " MB/sec"

print "--- summary ---"
print "total:", count, "failures:", failures

当我改变池大小时,我得到了很多不同的速度变化:-

pool = Pool(100)

我一直在考虑编写一个算法来即时计算理想的池大小,但我不想跳进去,我想知道我是否忽略了一些东西?

4

1 回答 1

5

任何并行处理都将受 CPU 限制或 IO 限制。从您的代码的性质来看,看起来在较小的池大小时,它将是 IO 绑定的。具体来说,它将受您接口的带宽以及系统可以支持的同时打开的套接字数量的限制(考虑这里的某些版本的 Windows,我曾多次设法用完可用的套接字)。当您增加池大小时,该进程可能会开始倾向于受 CPU 限制(特别是,如果您有更多的数据处理未在此处显示)。要将池大小保持在最佳值,您需要监视所有这些变量的使用情况(打开的套接字数、进程的带宽利用率、CPU 利用率等)。您可以通过在运行爬虫时分析指标来手动执行此操作并对池大小进行必要的调整,或者您可以尝试自动执行此操作。在 Python 中是否可以实现类似的功能是另一回事。

于 2012-08-15T15:30:27.130 回答