python - Python线程通信解决方案

Question

我正在编写一个用 Python 编写的非常基本的多线程网络爬虫，并为爬取页面和提取 url 的函数使用 While 循环，如下所示：

def crawl():
    while True:
        try:
            p = Page(pool.get(True, 10))
        except Queue.Empty:
            continue

        # then extract urls from a page and put new urls into the queue

（完整的源代码在另一个问题中：Multi-threaded Python Web Crawler Got Stuck）

现在理想情况下，我想向 While 循环添加一个条件，以使 while 循环在以下情况下退出：

池（存储 url 的 Queue 对象）为空，并且；
所有线程都在阻塞，等待从队列中获取 url（这意味着没有线程将新 url 放入池中，因此让它们等待没有意义，并且会使我的程序卡住。）

例如，类似：

#thread-1.attr == 1 means the thread-1 is blocking. 0 means not blocking

while not (pool.empty() and (thread-1.attr == 1 and thread-2.attr == 1 and ...)):
    #do the crawl stuff

所以我想知道是否有一个线程可以检查其他活动线程在做什么，或者其他活动线程的状态或属性值。

我已经阅读了有关threading.Event()的官方文档，但仍然无法弄清楚。

希望这里有人能给我指路:)

非常感谢！

马库斯

score 1 · Accepted Answer

您可以尝试从头开始实现您想要的东西，我现在想到了不同的解决方案：

使用threading.enumerate( ) 来检查是否有线程还活着。
尝试实现一个线程池，让您知道哪个线程a仍然活着，哪些线程返回到池中，这也有利于限制爬取第三方网站的线程数（例如，检查here ）。

如果你不想重新发明轮子，你可以使用现有的实现线程池的库，或者你也可以检查使用绿色线程并提供线程池的gevent，我已经使用类似的东西实现了类似的东西：

while 1:
    try:
        url = queue.get_nowait()
    except Empty:
        # Check that all threads are done.
        if pool.free_count() == pool.size:
            break
    ...

您还可以将哨兵对象写入队列，以标记爬行完成并存在主循环并等待线程完成（例如使用池）。

while 1:
    try:
        url = queue.get_nowait()
        # StopIteration mark that no url will be added to the queue anymore.
        if url is StopIteration:
             break
    except Empty:
        continue
    ...
pool.join()

您可以选择自己喜欢的，希望对您有所帮助。

score 0 · Accepted Answer

考虑看看这个解决方案：Web crawler Using Twisted。正如该问题的答案所说，我还建议您查看http://scrapy.org/

Python 中的多线程（直接使用线程）很讨厌，所以我会避免它并使用某种消息传递或基于反应器的编程。

python - Python线程通信解决方案

2 回答 2

Related

Reference