python - 循环中的python线程

Question

我有一个项目需要一堆大矩阵，这些矩阵存储在~200 MB 文件中，相互交叉相关（即 FFT * conj(FFT)）。文件的数量如此之多，以至于我不能将它们全部加载然后进行处理。另一方面，根据需要读取每个文件比我想要的要慢。

我到目前为止是这样的：

result=0
for i in xrange(N_files):
    f1 = file_reader(file_list[i])

    ############################################################################
    # here I want to have file_reader go start reading the next file I'll need #
    ############################################################################

    in_place_processing(f1)
    for j in xrange(i+1,N_files):
        f2 = file_reader(file_list[j])

        ##################################################################
        # here I want to have file_reader go start reading the next file #
        ##################################################################

        in_place_processing(f2)
        result += processing_function(f1,f2)

所以基本上，我只想有两个线程，每个线程都会读取一个文件，当我要求它时（或者在我要求它完成后立即给我），然后开始阅读下一个文件我要它。file_reader 返回的对象相当大且复杂，所以我不确定多处理是否是这里的方法......

我已经阅读了有关线程和队列的信息，但似乎无法弄清楚我要求线程去读取文件的部分，并且可以在它执行时继续执行程序。我不希望线程简单地在后台处理它们的业务——我是否在这里遗漏了一个细节，或者线程不是要走的路？

score 0 · Accepted Answer

下面是一个使用模块的示例，该multiprocessing模块将产生子进程来调用您的file_reader方法并将其结果排队。队列满时应该阻塞，因此您可以控制要使用QUEUE_SIZE常量执行的预读次数。

这利用了多进程通信的标准生产者/消费者模型，子进程充当生产者，主线程是消费者。类析构函数中的join方法调用确保正确清理子进程资源。有一些打印语句穿插用于演示目的。

此外，我为 QueuedFileReader 类添加了将工作卸载到工作线程或在主线程中运行的功能，而不是使用子进程进行比较。这是通过将类初始化时的模式参数分别指定为MODE_THREADS或来完成的MODE_SYNCHRONOUS。

import multiprocessing as mp
import Queue
import threading
import time

QUEUE_SIZE = 2 #buffer size of queue

## Placeholder for your functions and variables
N_files = 10
file_list = ['file %d' % i for i in range(N_files)]

def file_reader(filename):
    time.sleep(.1)
    result = (filename,'processed')
    return result

def in_place_processing(f):
    time.sleep(.2)

def processing_function(f1,f2):
    print f1, f2
    return id(f1) & id(f2)

MODE_SYNCHRONOUS = 0  #file_reader called in main thread synchronously
MODE_THREADS = 1      #file_reader executed in worker thread
MODE_PROCESS = 2      #file_reader executed in child_process
##################################################
## Class to encapsulate multiprocessing objects.
class QueuedFileReader():
    def __init__(self, idlist, mode=MODE_PROCESS):
        self.mode = mode
        self.idlist = idlist
        if mode == MODE_PROCESS:
            self.queue = mp.Queue(QUEUE_SIZE)
            self.process = mp.Process(target=QueuedFileReader.worker,
                                      args=(self.queue,idlist))
            self.process.start()
        elif mode == MODE_THREADS:
            self.queue = Queue.Queue(QUEUE_SIZE)
            self.thread = threading.Thread(target=QueuedFileReader.worker,
                                           args=(self.queue,idlist))
            self.thread.start()

    @staticmethod
    def worker(queue, idlist):
        for i in idlist:
            queue.put((i, file_reader(file_list[i])))
            print id(queue), 'queued', file_list[i]
        queue.put('done')

    def __iter__(self):
        if self.mode == MODE_SYNCHRONOUS:
            self.index = 0
        return self

    def next(self):
        if self.mode == MODE_SYNCHRONOUS:
            if self.index == len(self.idlist): raise StopIteration
            q = (self.idlist[self.index],
                 file_reader(file_list[self.idlist[self.index]]))
            self.index += 1
        else:
            q = self.queue.get()
            if q == 'done': raise StopIteration
        return q

    def __del__(self):
        if self.mode == MODE_PROCESS:
            self.process.join()
        elif self.mode == MODE_THREADS:
            self.thread.join()

#mode = MODE_PROCESS
mode = MODE_THREADS
#mode = MODE_SYNCHRONOUS
result = 0
for i, f1 in QueuedFileReader(range(N_files),mode):

    in_place_processing(f1)

    for j, f2 in QueuedFileReader(range(i+1,N_files),mode):
        in_place_processing(f2)
        result += processing_function(f1,f2)

如果您的中间值太大而无法通过队列，您可以在自己的进程中执行外循环的每次迭代。一个方便的方法是使用下面的示例中的Pool类。multiprocessing

import multiprocessing as mp
import time

## Placeholder for your functions and variables
N_files = 10
file_list = ['file %d' % i for i in range(N_files)]

def file_reader(filename):
    time.sleep(.1)
    result = (filename,'processed')
    return result

def in_place_processing(f):
    time.sleep(.2)

def processing_function(f1,f2):
    print f1, f2
    return id(f1) & id(f2)

def file_task(file_index):
    print file_index
    f1 = file_reader(file_list[file_index])
    in_place_processing(f1)
    task_result = 0
    for j in range(file_index+1, N_files):
        f2 = file_reader(file_list[j])
        in_place_processing(f2)
        task_result += processing_function(f1,f2)
    return task_result



pool = mp.Pool(processes=None) #processes default to mp.cpu_count()
result = 0
for file_result in pool.map(file_task, range(N_files)):
    result += file_result
print 'result', result

#or simply
#result = sum(pool.map(file_task, range(N_files)))

python - 循环中的python线程

1 回答 1

Related

Reference