python - 线程/多处理 - 匹配搜索具有 600k 个术语的 60gb 文件

Question

我有一个 python 脚本，在 1 个 CPU 上需要大约 93 天才能完成，或者在 64 个 CPU 上需要 1.5 天。

我有一个大文件（FOO.sdf），想从 FOO.sdf 中提取与模式匹配的“条目”。“条目”是由“$$$$”分隔的约 150 行的块。所需的输出是约 150 行的 600K 块。我现在拥有的这个脚本如下所示。有没有办法使用多处理或线程来跨多个内核/cpus/线程分配这个任务？我可以访问具有 64 个内核的服务器。

name_list = []
c=0

#Titles of text blocks I want to extract (form [...,'25163208',...])
with open('Names.txt','r') as names:
    for name in names:
        name_list.append(name.strip())

#Writing the text blocks to this file
with open("subset.sdf",'w') as subset:

    #Opening the large file with many textblocks I don't want
    with open("FOO.sdf",'r') as f:

        #Loop through each line in the file
        for line in f:

            #Avoids appending extreanous lines or choking 
            if line.split() == []:
                continue

            #Simply, this line would check if that line matches any name in "name_list".
            #But since I expect this is expensive to check, I only want it to occur if it passes the first two conditions.
            if ("-" not in line.split()[0]) and (len(line.split()[0]) >= 5) and (line.split()[0] in name_list):
                c=1 #when c=1 it designates that line should be written

            #Write this line to output file
            if c==1:
                subset.write(line)

            #Stop writing to file once we see "$$$$"
            if c==1 and line.split()[0] == "$$$$":
                c=0
                subset.write(line)

python - 线程/多处理 - 匹配搜索具有 600k 个术语的 60gb 文件

0 回答 0

Related

Reference