python-3.x - 使用多处理并行化列表项附加到 dict

Question

我有一个包含字符串的大列表。我希望从此列表中创建一个字典，以便：

list = [str1, str2, str3, ....]

dict = {str1:len(str1), str2:len(str2), str3:len(str3),.....}

我的解决方案是一个 for 循环，但它花费了太多时间（我的列表包含近 1M 元素）：

for i in list:
    d[i] = len(i)

我希望在 python 中使用多处理模块以利用所有内核并减少进程执行所需的时间。我遇到了一些粗略的例子，涉及管理器模块在不同进程之间共享 dict 但无法实现它。任何帮助，将不胜感激！

score 1 · Accepted Answer

我不知道使用多进程是否会更快，但这是一个有趣的实验。

一般流程：

创建随机单词列表
将列表拆分为段，每个进程一个段
运行进程，将段作为参数传递
将结果字典合并到单个字典

试试这个代码：

import concurrent.futures
import random
from multiprocessing import Process, freeze_support
    
def todict(lst):
   print(f'Processing {len(lst)} words')
   return {e:len(e) for e in lst}  # convert list to dictionary   

if __name__ == '__main__':
    freeze_support()  # needed for Windows
    
    # create random word list - max 15 chars
    letters = [chr(x) for x in range(65,65+26)] # A-Z
    words = [''.join(random.sample(letters,random.randint(1,15))) for w in range(10000)] # 10000 words

    words = list(set(words))  # remove dups, count will drop

    print(len(words))
    
    ########################
    
    cpucnt = 4  # process count to use
    
    # split word list for each process
    wl = len(words)//cpucnt + 1  # word count per process
    lstsplit = []
    for c in range(cpucnt):
       lstsplit.append(words[c*wl:(c+1)*wl]) # create word list for each process

    # start processes
    with concurrent.futures.ProcessPoolExecutor(max_workers=cpucnt) as executor:
        procs = [executor.submit(todict, lst) for lst in lstsplit]
        results = [p.result() for p in procs]  # block until results are gathered
    
    # merge results to single dictionary
    dd = {}
    for r in results:
       dd.update(r)
    
    print(len(dd))  # confirm match word count
    with open('dd.txt','w') as f: f.write(str(dd)) # write dictionary to text file

python-3.x - 使用多处理并行化列表项附加到 dict

1 回答 1

Related

Reference