websocket - 使用 websocket 在烧瓶中排队

Question

我在一个项目中使用 Flask、Gevent 和 scrapy。基本思想是您输入一个 url，它以输入作为参数启动一个爬虫进程。它目前似乎与通过 websocket 管道输出的效果很好。

我很好奇处理同时运行的多个爬虫的最佳方法是什么，所以如果两个人同时输入一个 url。我认为最好的方法是队列系统，理想情况下我只希望同时运行可控数量的爬虫。

有没有关于如何使用我已经在使用的库来解决这个问题的建议？或者也许建议一种不同的方法？

score 0 · Accepted Answer

尝试 nodejs ， webtcp（用于 websockets）和每个爬虫的异步调用。同样，一旦完成爬网，您可以将其保存在临时存储中，例如带有过期键的 memcached 或 redis。

因此，当有类似的抓取请求时，您可以从临时存储中提供它

score 0 · Accepted Answer

如果爬虫是 gevent 作业，您可以使用池。

http://www.gevent.org/gevent.pool.html

Group 的一个子类 Pool 提供了一种限制并发的方法：如果池中的 greenlets 数量已经达到限制，它的 spawn 方法将阻塞，直到有空闲槽。

伪代码：

crawler_pool = Pool(10)

def spawncrawler(url):
    def start():
         crawler_pool.spawn(crawl, url)  # blocks when max is reached.

    gevent.spawn(start)
    # give a response to the browser. this will always succeed because
    # i put the spawning of the crawler in a separate greenlet so if max 
    # 10 crawlers is reached the greenlet just holds on untill there is space
    # and client can get default response..

websocket - 使用 websocket 在烧瓶中排队

2 回答 2

Related

Reference