我有一个 python 脚本,在 1 个 CPU 上需要大约 93 天才能完成,或者在 64 个 CPU 上需要 1.5 天。
我有一个大文件(FOO.sdf),想从 FOO.sdf 中提取与模式匹配的“条目”。“条目”是由“$$$$”分隔的约 150 行的块。所需的输出是约 150 行的 600K 块。我现在拥有的这个脚本如下所示。有没有办法使用多处理或线程来跨多个内核/cpus/线程分配这个任务?我可以访问具有 64 个内核的服务器。
name_list = []
c=0
#Titles of text blocks I want to extract (form [...,'25163208',...])
with open('Names.txt','r') as names:
for name in names:
name_list.append(name.strip())
#Writing the text blocks to this file
with open("subset.sdf",'w') as subset:
#Opening the large file with many textblocks I don't want
with open("FOO.sdf",'r') as f:
#Loop through each line in the file
for line in f:
#Avoids appending extreanous lines or choking
if line.split() == []:
continue
#Simply, this line would check if that line matches any name in "name_list".
#But since I expect this is expensive to check, I only want it to occur if it passes the first two conditions.
if ("-" not in line.split()[0]) and (len(line.split()[0]) >= 5) and (line.split()[0] in name_list):
c=1 #when c=1 it designates that line should be written
#Write this line to output file
if c==1:
subset.write(line)
#Stop writing to file once we see "$$$$"
if c==1 and line.split()[0] == "$$$$":
c=0
subset.write(line)