python - python中的多线程使用队列

Question

我对 Python 很陌生，我需要在我的代码中实现多线程。

我有一个巨大的 .csv 文件（百万行）作为我的输入。我阅读了这一行，对每一行提出休息请求，对每一行进行一些处理并将输出写入另一个文件。输入/输出文件中的行顺序很重要。现在我正在逐行执行此操作。我想并行运行相同的代码，即从 .csv 文件中读取 20 行输入，然后并行调用其余的代码，这样我的程序就更快了。

我一直在阅读http://docs.python.org/2/library/queue.html，但我读到了 python GIL 问题，它说即使在多线程之后代码也不会运行得更快。有没有其他方法可以简单地实现多线程？

score 2 · Accepted Answer

您可以将 .csv 文件分成多个较小的文件吗？如果可以，那么您可以让另一个程序运行多个版本的处理器。

假设这些文件都被命名为file1、file2等，并且您的处理器将文件名作为参数。你可以有：

import subprocess
import os
import signal

for i in range(1,numfiles):
    program = subprocess.Popen(['python'], "processer.py", "file" + str(i))
    pid = program.pid

    #if you need to kill the process:
    os.kill(pid, signal.SIGINT)

score 1 · Accepted Answer

Python 在 IO 上发布 GIL。如果大部分时间都花在做休息请求上；您可以使用线程来加快处理速度：

try:
    from gevent.pool import Pool # $ pip install gevent
    import gevent.monkey; gevent.monkey.patch_all() # patch stdlib
except ImportError: # fallback on using threads
    from multiprocessing.dummy import Pool

import urllib2    

def process_line(url):
    try:
        return urllib2.urlopen(url).read(), None
    except EnvironmentError as e:
        return None, e

with open('input.csv', 'rb') as file, open('output.txt', 'wb') as outfile:
    pool = Pool(20) # use 20 concurrent connections
    for result, error in pool.imap_unordered(process_line, file):
        if error is None:
            outfile.write(result)

如果输入/输出顺序应该相同；你可以使用imap而不是imap_unordered.

如果您的程序受 CPU 限制；您可以使用multiprocessing.Pool()它来创建多个进程。

另请参阅Python 解释器阻止多线程 DNS 请求？

这个答案显示了如何使用 threading + Queue modules 手动创建线程池。

python - python中的多线程使用队列

2 回答 2

Related

Reference