python - 比 C 等价物更快的搜索 python 脚本

Question

对于我的研究，我需要处理生物序列（fasta 格式）的巨大文本文件（10gi），更准确地说，我必须将具有特定 ID 的特殊特定序列放入其中。fasta 序列是这样的：

>id|id_number（例如 102574）|东西

ATGCGAT.. ATGTC..（多行）

所以我编写了脚本来搜索这些大文件的块，以便使用 python 的多处理库并行化我的搜索（并使用我的 8 cpu）

我注入我的多进程类的函数如下：

idlist=inP[0] # list of good id 
    filpath=inP[1] # chunck of the big file
    idproc=inP[2]   # id of the process

    #######################
    fil=filpath.split('\n')
    del filpath
    f=open('seqwithid{0}'.format(idproc),'w')
    def lineiter():
        for line in fil:
            yield line
    it=lineiter()
    line=it.next()

    while 1:
        try:
            ids=line.split('|')[1].split('locus')[0].partition('ref')[0]
            #print ids
            while ids[0].isalpha():
                ids=ids[1:]
        except Exception:
            pass
        else:
            if ids in idlist: 
                f.write(line+'\n')
                while 1:
                    try:
                        line=it.next()
                    except Exception:
                        break
                    if line and line[0]!='>':
                        f.write(line+'\n')
                    else:
                        break
        try:                
            line=it.next()
        except Exception:
            break
        while  not line or line[0]!='>':
            try:
                line=it.next()
            except Exception:
                break
    f.close()

为了提高速度，我用 4 个函数用 C 重写了这段代码：

我将文件切成块：

f1=fopen(adr, "r");
if (f1==0){printf("wrong sequences file: %s\n",adr);exit(1);}

fstream = (char *) malloc((end-begin)*sizeof(char) );
fseek(f1,begin,SEEK_CUR);
fread(fstream,sizeof(char)*(end-begin-1),1,f1);
adrtampon=fgetc(f1);

while (!(feof(f1)) && adrtampon!=ter)
{
    sprintf(fstream,"%s%c",fstream,adrtampon);
    adrtampon=fgetc(f1);
}
fclose(f1);

我使用 run trought 带有 main 函数的块，直到找到一个 '>' 字符：

adrtampon=fstream[0];   
i=0;

while(adrtampon!='\0' )
{
    adrtampon=fstream[i];
    if (adrtampon==ter)
    {
        sprintf(id,"%s",seekid((fstream+i)));

        if (checkidlist(id,tab,size)==0) 
        {
            i++;
            fputc('>',f2);
            adrtampon=fstream[i];
            while (adrtampon!='\0' &&  adrtampon!=ter)  
            {
                fputc(adrtampon,f2);
                i++;
                adrtampon=fstream[i];
            }
            i--;
        }
    }
    i++;
}

当我找到'>'时，我首先提取两个'|'之间的序列的id 然后我用另一个简单的函数（类似于 idlist 中的 if id）循环我的 intersting id 库然后用仍然使用多处理类的 python 函数调用这个函数最后......我用 C 代码获得比python代码，即使只有一个进程。（当我直接处理文件而不是块时，我使用 C 获得更好的性能，但由于对多进程文件的并发访问（我认为）只有一个进程））任何改进我的 C 代码并解释原因的建议它比python中的同等速度慢？？？？？？多谢！！（特别是如果你已经阅读到这里！）

score 0 · Accepted Answer

可能一个 10GB 的文件不适合内存（如果它适合内存，你可以像我在这里做的那样做。）所以读取和处理它的唯一方法是：读取一部分，处理那部分，读取下一部分. 如果 linelength 有限， fgets() 是最优雅的。否则，您可以一次读取一个字符并使用小型状态机进行处理。读取缓冲区大小的块是可能的，但更难，因为逻辑行将跨越缓冲区边界。

python - 比 C 等价物更快的搜索 python 脚本

1 回答 1

Related

Reference