1

我有一个程序,它的目的是读取八个文件,这些文件长一百万个字符,没有标点符号,只有一堆字符。

这八个文件代表找到的四个 DNA 样本,程序所做的是从样本中的一个文件中提取字符,并将它们与同一样本的另一个文件中的字符组合。例如,如果 file1 读取:

abcdefg

和 file2 读取:

hijklmn

组合将是:

ah, bi, cj, dk, el, fm, gn

无论如何,程序会继续计算每对组合存在多少个组合,它会打印出一个字典,该字典将读取如下内容,例如:

{'mm': 52, 'CC': 66, 'SS': 24, 'cc': 19, 'MM': 26, 'ss': 58, 'TT': 43, 'tt': 32}

问题是,虽然该程序适用于小文件,但对于百万字符长(是的,这是一个字面数字,而不是夸张)文件,程序挂起,似乎永远无法完成任务。(我让它通宵运行一次,但没有任何结果。)

是溢出错误,还是我使用的方法对于大文件来说太小了?有没有更好的方法来处理这个?

我的代码:

import re
from collections import Counter

def ListStore(fileName):
    '''Purpose, stores the contents of file into a single string'''           

    #old code left in for now
    '''
    with open(fileName, "r") as fin:
        fileContents = fin.read().rstrip()
        fileContents = re.sub(r'\W', '', fin.read())
    '''
    #opens up the file given to the function
    fin = open(fileName,'r')

    #reads the file into a string, strips out the newlines as well
    fileContents = fin.read().rstrip()


    #closes up the file
    fin.close()

    #splits up the fileContents into a list of characters
    fileContentsList = list(fileContents)   

    #returns the string
    return fileContentsList


def ListCombo(list1, list2):
    '''Purpose: combines the two DNA lists into one'''


    #creates an empty dictionary for list3
    list3 = []

    #combines the codes from one hlaf with their matching from the other
    list3 = [''.join(pair) for pair in zip(list1, list2)]

    return list3


def printResult(list):
    '''stores the result of the combination in a dictionary'''




    #stores the result into a dictionary
    result = dict((i,list.count(i)) for i in list)

    print (result)
    return result


def main():

    '''Purpose: Reads the contents of 8 files, and finds out how many
    combinations exist'''


    #first sample files

    file_name = "a.txt"
    file_name2 = "b.txt"

    #second sample files
    file_name3 = "c.txt"
    file_name4 = "d.txt"

    #third sample files
    file_name5 = "e.txt"
    file_name6 = "f.txt"

    #fourth sample files
    file_name7 = "g.txt"
    file_name8 = "h.txt"


    #Get the first sample ready

    #store both sides into a list of characters

    contentList = ListStore(file_name)

    contentList2 = ListStore(file_name2)

    #combine the two lists together
    combo_list = ListCombo(contentList, contentList2)

    #store the first sample results into a dictionary
    SampleA = printResult(combo_list)

    print (SampleA)

    # ****Get the second sample ready****

    #store both sides into a list of characters
    contentList3 = ListStore(file_name3)
    contentList4 = ListStore(file_name4)

    #combine the two lists together
    combo_list2 = ListCombo(contentList3, contentList4)

    #store the first sample results into a dictionary
    SampleB = printResult(combo_list2)

    print (SampleB)

    # ****Get the third sample ready****

    #store both sides into a list of characters
    contentList5 = ListStore(file_name5)
    contentList6 = ListStore(file_name6)

    #combine the two lists together
    combo_list3 = ListCombo(contentList5, contentList6)

    #store the third sample results into a dictionary
    SampleC = printResult(combo_list3)

    print (SampleC)

    # ****Get the second sample ready****

    #store both sides into a list of characters
    contentList7 = ListStore(file_name7)
    contentList8 = ListStore(file_name8)

    #combine the two lists together
    combo_list4 = ListCombo(contentList7, contentList8)

    #store the fourth sample results into a dictionary
    SampleD = printResult(combo_list4)

    print (SampleD)



if __name__ == '__main__':
    main()
4

3 回答 3

2

不要将全部内容读入内存。没有必要。此外,已经zip()将您的字符串拆分为字符,因此您无需自己执行此操作。

这里的诀窍是创建一个生成器,它可以在读取两个块中的两个文件时配对你的字符,这将是读取文件的最有效方式。

最后,用于collections.Counter()记数:

from functools import partial
from collections import Counter

with open(filename1, 'r') as file1, open(filename2, 'r') as file2:
    chunked1 = iter(partial(file1.read, 1024), '')
    chunked2 = iter(partial(file2.read, 1024), '')
    counts = Counter(''.join(pair) for chunks in zip(chunked1, chunked2) for pair in zip(*chunks))

这里的代码是以 1024 字节的块读取的;根据需要进行调整以获得最佳性能。一次在内存中保存的文件不超过 2048 个字节,在计数时动态生成对。

于 2013-07-30T13:40:59.437 回答
1

如所写,我个人认为您的程序不受 I/O 限制——即使是,将其分解为许多调用,即使是缓冲的,也不会像您将整个内容读入内存那样快在做。也就是说,我不确定为什么你的程序需要这么长时间来处理巨大的文件——这可能是它正在做的许多不需要的操作,因为字符串和列表都是序列,所以通常不需要从一个转换到另一个.

这是您的程序的优化版本,其中删除了大部分冗余和/或不必要的内容。它实际上利用了collections.Counter在您的代码中导入但从未使用过的类,并且即使它仍将文件的内容读入内存,它也只会在处理每对文件所需的最短时间内保留这些内容。

from collections import Counter
import os

DATA_FOLDER = 'datafiles' # folder path to data files ('' for current dir)

def ListStore(fileName):
    '''return contents of file as a single string with any newlines removed'''
    with open(os.path.join(DATA_FOLDER, fileName), 'r') as fin:
        return fin.read().replace('\n', '')

def ListCombo(seq1, seq2):
    '''combine the two DNA sequences into one'''
    # combines the codes from one half with their matching from the other
    return [''.join(pair) for pair in zip(seq1, seq2)]

def CountPairs(seq):
    '''counts occurences of pairs in the list of the combinations and stores
    them in a Counter dict instance keyed by letter-pairs'''
    return Counter(seq)

def PrintPairs(counter):
    #print the results in the counter dictionary (in sorted order)
    print('{' + ', '.join(('{}: {}'.format(pair, count)
        for pair, count in sorted(counter.items()))) + '}')

def ProcessSamples(file_name1, file_name2):
    # store both sides into a list of characters
    contentList1 = ListStore(file_name1)
    contentList2 = ListStore(file_name2)

    # combine the two lists together
    combo_list = ListCombo(contentList1, contentList2)

    # count the sample results and store into a dictionary
    counter = CountPairs(combo_list)

    #print the results
    PrintPairs(counter)

def main():
    '''reads the contents of N pairs of files, and finds out how many
    combinations exist in each'''
    file_names = ('a.txt', 'b.txt',
                  'c.txt', 'd.txt',
                  'e.txt', 'f.txt',
                  'g.txt', 'h.txt',)

    for (file_name1, file_name2) in zip(*([iter(file_names)]*2)):
        ProcessSamples(file_name1, file_name2)

if __name__ == '__main__':
    main()
于 2013-07-30T16:04:41.400 回答
1

在您的printResult方法中,您遍历 中的每个元素ilist并将值分配给字典中list.count(i)的键。iresult

我不完全确定它是如何count(i)工作的,但我相信它涉及搜索列表的大部分内容,并计算每次运行时的元素数量i。在您的代码中,如果您有重复项,例如 in ['aa','bb','aa'],您将计算列表中有多少元素'aa'两次,每次都遍历整个列表。这在长列表中非常耗时。

您只需要浏览一次列表即可计算每种类型的元素数量。我建议defaultdict为此使用 a ,因为您可以使每个新key的开始都使用默认值0

    from collections import defaultdict
    result = defaultdict(int)
    for i in list:
        result[i] = result[i] + 1
    print result

创建一个defaultdictwithint允许每个新key的从 value 开始0。然后,您可以遍历列表一次,1每次找到它时都将其添加到每对的值中。这消除了多次浏览列表。

于 2013-07-30T13:48:15.630 回答