0

我正在尝试在 python 中编写一个程序,该程序将符合某些条件的数据行从输入文件解析到一系列输出文件。

该程序读取一个输入文件,该文件包含染色体上基因的起始和终止位置。对于此输入文件的每一行,它会逐行打开第二个输入文件,其中包含目标染色体上已知 SNP 的位置。如果 SNP 位于被迭代基因的开始和停止位置之间,则将其复制到新文件中。

我的程序的问题在于它效率低下。对于每个被分析的基因,程序从第一行开始读取 SNP 数据的输入文件,直到它到达位于大于(即具有更高位置编号)的染色体位置的 SNP被迭代的基因的停止位置。由于所有基因和 SNP 数据都是按染色体位置排序的,如果对于每个被迭代的基因,我可以以某种方式“告诉”我的程序开始读取 SNP 位置数据的输入文件,我的程序的速度和效率将大大提高从上次迭代中读取的最后一行开始;而不是从文件的第一行开始。

有什么办法可以做这个 Python 吗?还是必须从第一行读取所有文件?

到目前为止,我的代码如下。任何建议将不胜感激。

import sys
import fileinput
import shlex
geneCoordinates = open("Gene Coordinates.txt",'r')
geneCoordinates = list(geneCoordinates)
n = (len(geneCoordinates))
nSNPsPerGene=open("C:/Users/gwilymh/Desktop/Python/SNPsPerGene/nSNPs per gene.txt", 'a')

i=0
for i in range(i,n):
    x=i
    L=shlex.shlex(geneCoordinates[x],posix=True)
    L.whitespace += ','
    L.whitespace_split = True
    L=list(L)
    output=open((("C:/Users/gwilymh/Desktop/Python/SNPsPerGene/%s.txt")%(str(L[2]))), 'a')
    geneStart=int(L[2])
    geneStop=int(L[3])
    for line in fileinput.input("SNPs.txt"):
        if not fileinput.isfirstline():
            nSNPs=0
            SNP=shlex.shlex(line,posix=True)
            SNP.whitespace += '\t'
            SNP.whitespace_split = True
            SNP=list(SNP)
            SNPlocation=int(SNP[0])
            if SNPlocation < geneStart:
                continue
            if SNPlocation >= geneStart:
                if SNPlocation <= geneStop:
                    nSNPs=nSNPs+1
                    output.write(str(SNP))
                    output.write("\n")
            else: break
    nSNPsPerGene.write(("%s\t%s")%s(str(L[2]),nSNPs))
4

1 回答 1

1

只需使用迭代器(在循环之外的范围内)来跟踪您在第二个文件中的位置。它应该看起来像这样:

import shlex
geneCoordinates = open("Gene Coordinates.txt",'r')
geneCoordinates = list(geneCoordinates)
n = (len(geneCoordinates))
nSNPsPerGene=open("C:/Users/gwilymh/Desktop/Python/SNPsPerGene/nSNPs per gene.txt", 'a')

i=0

#NEW CODE - 2 lines added.  By opening a file iterator outside of the loop, we can remember our position in it
SNP_file = open("SNPs.txt")
SNP_file.readline() #chomp up the first line, so we don't have to constantly check we're not at the beginning
#end new code.


for i in range(i,n):

   x=i
   L=shlex.shlex(geneCoordinates[x],posix=True)
   L.whitespace += ','
   L.whitespace_split = True
   L=list(L)
   output=open((("C:/Users/gwilymh/Desktop/Python/SNPsPerGene/%s.txt")%(str(L[2]))), 'a')
   geneStart=int(L[2])
   geneStop=int(L[3])

   #NEW CODE - deleted 2 lines, added 4
   #loop until break
   While 1:
      line = SNP_file.readLine()
      if not line: #exit loop if end of file reached
         break
      #end new code - the rest of your loop should behave normally

      nSNPs=0
      SNP=shlex.shlex(line,posix=True)
      SNP.whitespace += '\t'
      SNP.whitespace_split = True
      SNP=list(SNP)
      SNPlocation=int(SNP[0])
      if SNPlocation < geneStart:
          continue
      #NEW CODE - 1 line changed
      else: #if SNPlocation >= geneStart: 
      #logic dictates that if SNPLocation is not < geneStart, then it MUST be >= genestart. so ELSE is sufficient
          if SNPlocation <= geneStop:
              nSNPs=nSNPs+1
              output.write(str(SNP))
              output.write("\n")
              #NEW CODE 1 line added- need to exit this loop once we have found a match.
              #NOTE - your old code would return the LAST match. new code returns the FIRST match.
              #assuming there is only 1 match this won't matter... but I'm not sure if that assumption is true.
              break
      #NEW CODE - 1 line deleted
      #else: break else nolonger required. there are only two possible options.

      j = j+1
   nSNPsPerGene.write(("%s\t%s")%s(str(L[2]),nSNPs))
于 2013-02-13T02:10:52.960 回答