python - 如何在 Python 中并行化 I/O 绑定操作？

Question

我正在用 tweepy 处理推文：

class StdOutListener(StreamListener):
    def on_data(self, data):
        process(json.loads(data))
        return True

l = StdOutListener()
stream = Stream(auth, l)
stream.filter(track=utf_words)

该process函数获取包含在推文中的 URL（带有请求）的内容，使用 nltk 处理数据（我猜这会占用一点 CPU）并将结果保存到 Mongo。

问题是获取包含 URL 的内容需要很长时间，因此限制了我的处理速度。我如何以python方式加速这件事？

score 3 · Accepted Answer

你可以使用python的threading模块：

import threading

class YourThreadSubclass(threading.Thread):
    def __init__(self,your_args):
        threading.Thread.__init__(self)
        #do whatever setup you want
    
    def run(self):
        process_data(self.some_property)

threads = [YourThreadSubclass(args) for args in Iterable]
for t in threads:
    t.start()
for t in threads:
    t.join()
return reduce(combiner, (t.result_field for t in threads))

更多信息在这里：http ://docs.python.org/2/library/threading.html

编辑：更直接地说，您可以在调用 on_data 时分叉一个线程。

def on_data(self, data):
    YourThreadSubclass(data).start()

分叉的线程将异步存储其结果。

如果您正在处理大量请求，您可能还想使用线程池来管理您的线程。文档在这里

python - 如何在 Python 中并行化 I/O 绑定操作？

1 回答 1

Related

Reference