python - 有没有办法简化这段代码？

Question

我正在做一些生物信息学研究，而且我是 python 新手。我编写了这段代码来解释一个包含蛋白质序列的文件。文件“bulk_sequences.txt”本身包含 71,423 行信息。三行表示一个蛋白质序列，第一行提供信息，包括发现蛋白质的年份，（这就是“/1945”的全部内容）。”对于 1000 行的较小样本，它工作得很好。但是我给它的这个大文件，它似乎需要很长时间。我可以做些什么来简化这个？

它旨在对文件进行排序，按发现年份对其进行排序，然后将所有三行蛋白质序列数据分配给数组“sortedsqncs”中的一个项目

    import time
    start = time.time()



    file = open("bulk_sequences.txt", "r")
    fileread = file.read()
    bulksqncs = fileread.split("\n")
    year = 1933
    newarray = []
    years = []
    thirties = ["/1933","/1934","/1935","/1936","/1937","/1938","/1939","/1940","/1941","/1942"]## years[0]
    forties = ["/1943","/1944","/1945","/1946","/1947","/1948","/1949","/1950","/1951","/1952"]## years[1]
    fifties = ["/1953","/1954","/1955","/1956","/1957","/1958","/1959","/1960","/1961","/1962"]## years[2]
    sixties = ["/1963","/1964","/1965","/1966","/1967","/1968","/1969","/1970","/1971","/1972"]## years[3]
    seventies = ["/1973","/1974","/1975","/1976","/1977","/1978","/1979","/1980","/1981","/1982"]## years[4]
    eighties = ["/1983","/1984","/1985","/1986","/1987","/1988","/1989","/1990","/1991","/1992"]## years[5]
    nineties = ["/1993","/1994","/1995","/1996","/1997","/1998","/1999","/2000","/2001","/2002"]## years[6]
    twothsnds = ["/2003","/2004","/2005","/2006","/2007","/2008","/2009","/2010","/2011","/2012"]## years[7]

    years = [thirties,forties,fifties,sixties,seventies,eighties,nineties,twothsnds]
    count = 0
    sortedsqncs = []


    for x in range(len(years)):
        for i in range(len(years[x])):
                for y in bulksqncs:
                        if years[x][i] in y:
                            for n in range(len(bulksqncs)):
                                if y in bulksqncs[n]:
                                    sortedsqncs.append(bulksqncs[n:n+3])
                                    count +=1
    print len(sortedsqncs)

    end = time.time()
    print round((end - start),4)

score 5 · Accepted Answer

tcaswell 的 itertools.izip_longest() 解决方案非常优雅，但是如果您不经常使用更高级别的迭代工具，您可能会忘记它是如何工作的，并且将来您的代码可能会变得难以理解。

但是 tcaswell 从根本上说是正确的，您循环文件的次数太多了。至少从可读性和可维护性的角度来看，其他低效率是预定义的年份数组。此外，您几乎不应该使用range(len(seq))- 几乎总是有更好（更pythonic）的方式。最后，readlines()如果您想要文件中的行列表，请使用。

一个更简单的解决方案是：

按照 tcaswell 的建议编写一个函数 extract_year() 以从输入行 (bulksqncs) 返回年份，如果没有找到年份，则返回 None。您可以使用正则表达式，或者如果您知道年份在行中的位置，请使用它。
循环输入并提取所有序列，将每个序列分配给一个元组（年份，三行序列）并将元组添加到列表中。这也允许输入文件具有散布在序列中的非序列。
按年份对元组列表进行排序。
从已排序的元组列表中提取序列。

示例代码 - 这将为您提供排序序列的 Python 列表：

bulksqncs = infile.readlines()
sq_tuple = []
for idx, line in enumerate(bulksqncs):
   if extract_year(line):
     sq_tuple.append((extract_year(line), bulksqncs[idx:idx+3]))
sq_tuple.sort()
sortedsqncs = ['\n'.join(item[1]) for item in sq_tuple]

score 4 · Accepted Answer

问题是你在你的巨型文件上循环了很多次。您可以一次性完成：

from itertools import izip_longest

#http://docs.python.org/2/library/itertools.html
def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

# fold your list into a list of length 3 tuples
data = [n for n in grouper(bulksqncs, 3)]
# sort the list
# tuples will 'do the right thing' by default if the line starts with the year
data.sort()

如果您的年线不是以年份开头，则需要使用keykwarg 来sort

data.sort(key=lamdba x: extract_year(x[0]))

score 3 · Accepted Answer

问题是，每次您在该行中找到一年时，您都会再次遍历文件 ( for n in range(len(bulksqncs)))，因此总共有大约 1360 亿次 (=71423 * (71423 / 3) * 80) 次迭代。您可以将其减少到 600 万（71423 * 80）以下，这仍然需要一些时间，但应该是可控的。

对主循环的一个简单修复是用于enumerate获取行号，而不必再次从头开始遍历整个文件：

for decade in decades:
    for year in decade:
        for n, line in enumerate(bulksqncs):
            if year in line:
                sortedsqncs.append(bulksqncs[n:n + 3])
                count += 1

然而，通过将年循环放在从文件中读取行的循环中，可以进一步减少时间。我会考虑使用字典，并从文件中一次读取一行（而不是一次读取整个内容read()）。当您在一行中找到一年时，您可以使用next抓取接下来的两行以及您当前所在的行。然后程序break退出年份循环，避免不必要的迭代（假设同一行中不可能有超过一年的时间）。

years = ['/' + str(y) for y in range(1933, 2013)]
sequences = dict((year, []) for year in years)

with open("bulk_sequences.txt", "r") as bulk_sequences:
    for line in bulk_sequences:
        for year in years:
            if year in line:
                sequences[year].append((line, 
                                        bulk_sequences.next(),
                                        bulk_sequences.next()))
                break

然后可以将排序列表获得为

[sequences[year] for year in years]

或者使用 anOrderedDict来保持序列有序。

python - 有没有办法简化这段代码？

3 回答 3

Related

Reference