python - Python3：我用来计算组合结果的方法是否太慢？

Question

我有一个程序，它的目的是读取八个文件，这些文件长一百万个字符，没有标点符号，只有一堆字符。

这八个文件代表找到的四个 DNA 样本，程序所做的是从样本中的一个文件中提取字符，并将它们与同一样本的另一个文件中的字符组合。例如，如果 file1 读取：

abcdefg

和 file2 读取：

hijklmn

组合将是：

ah, bi, cj, dk, el, fm, gn

无论如何，程序会继续计算每对组合存在多少个组合，它会打印出一个字典，该字典将读取如下内容，例如：

{'mm': 52, 'CC': 66, 'SS': 24, 'cc': 19, 'MM': 26, 'ss': 58, 'TT': 43, 'tt': 32}

问题是，虽然该程序适用于小文件，但对于百万字符长（是的，这是一个字面数字，而不是夸张）文件，程序挂起，似乎永远无法完成任务。（我让它通宵运行一次，但没有任何结果。）

是溢出错误，还是我使用的方法对于大文件来说太小了？有没有更好的方法来处理这个？

我的代码：

import re
from collections import Counter

def ListStore(fileName):
    '''Purpose, stores the contents of file into a single string'''           

    #old code left in for now
    '''
    with open(fileName, "r") as fin:
        fileContents = fin.read().rstrip()
        fileContents = re.sub(r'\W', '', fin.read())
    '''
    #opens up the file given to the function
    fin = open(fileName,'r')

    #reads the file into a string, strips out the newlines as well
    fileContents = fin.read().rstrip()


    #closes up the file
    fin.close()

    #splits up the fileContents into a list of characters
    fileContentsList = list(fileContents)   

    #returns the string
    return fileContentsList


def ListCombo(list1, list2):
    '''Purpose: combines the two DNA lists into one'''


    #creates an empty dictionary for list3
    list3 = []

    #combines the codes from one hlaf with their matching from the other
    list3 = [''.join(pair) for pair in zip(list1, list2)]

    return list3


def printResult(list):
    '''stores the result of the combination in a dictionary'''




    #stores the result into a dictionary
    result = dict((i,list.count(i)) for i in list)

    print (result)
    return result


def main():

    '''Purpose: Reads the contents of 8 files, and finds out how many
    combinations exist'''


    #first sample files

    file_name = "a.txt"
    file_name2 = "b.txt"

    #second sample files
    file_name3 = "c.txt"
    file_name4 = "d.txt"

    #third sample files
    file_name5 = "e.txt"
    file_name6 = "f.txt"

    #fourth sample files
    file_name7 = "g.txt"
    file_name8 = "h.txt"


    #Get the first sample ready

    #store both sides into a list of characters

    contentList = ListStore(file_name)

    contentList2 = ListStore(file_name2)

    #combine the two lists together
    combo_list = ListCombo(contentList, contentList2)

    #store the first sample results into a dictionary
    SampleA = printResult(combo_list)

    print (SampleA)

    # ****Get the second sample ready****

    #store both sides into a list of characters
    contentList3 = ListStore(file_name3)
    contentList4 = ListStore(file_name4)

    #combine the two lists together
    combo_list2 = ListCombo(contentList3, contentList4)

    #store the first sample results into a dictionary
    SampleB = printResult(combo_list2)

    print (SampleB)

    # ****Get the third sample ready****

    #store both sides into a list of characters
    contentList5 = ListStore(file_name5)
    contentList6 = ListStore(file_name6)

    #combine the two lists together
    combo_list3 = ListCombo(contentList5, contentList6)

    #store the third sample results into a dictionary
    SampleC = printResult(combo_list3)

    print (SampleC)

    # ****Get the second sample ready****

    #store both sides into a list of characters
    contentList7 = ListStore(file_name7)
    contentList8 = ListStore(file_name8)

    #combine the two lists together
    combo_list4 = ListCombo(contentList7, contentList8)

    #store the fourth sample results into a dictionary
    SampleD = printResult(combo_list4)

    print (SampleD)



if __name__ == '__main__':
    main()

score 2 · Accepted Answer

不要将全部内容读入内存。没有必要。此外，已经zip()将您的字符串拆分为字符，因此您无需自己执行此操作。

这里的诀窍是创建一个生成器，它可以在读取两个块中的两个文件时配对你的字符，这将是读取文件的最有效方式。

最后，用于collections.Counter()记数：

from functools import partial
from collections import Counter

with open(filename1, 'r') as file1, open(filename2, 'r') as file2:
    chunked1 = iter(partial(file1.read, 1024), '')
    chunked2 = iter(partial(file2.read, 1024), '')
    counts = Counter(''.join(pair) for chunks in zip(chunked1, chunked2) for pair in zip(*chunks))

这里的代码是以 1024 字节的块读取的；根据需要进行调整以获得最佳性能。一次在内存中保存的文件不超过 2048 个字节，在计数时动态生成对。

score 1 · Accepted Answer

如所写，我个人认为您的程序不受 I/O 限制——即使是，将其分解为许多调用，即使是缓冲的，也不会像您将整个内容读入内存那样快在做。也就是说，我不确定为什么你的程序需要这么长时间来处理巨大的文件——这可能是它正在做的许多不需要的操作，因为字符串和列表都是序列，所以通常不需要从一个转换到另一个.

这是您的程序的优化版本，其中删除了大部分冗余和/或不必要的内容。它实际上利用了collections.Counter在您的代码中导入但从未使用过的类，并且即使它仍将文件的内容读入内存，它也只会在处理每对文件所需的最短时间内保留这些内容。

from collections import Counter
import os

DATA_FOLDER = 'datafiles' # folder path to data files ('' for current dir)

def ListStore(fileName):
    '''return contents of file as a single string with any newlines removed'''
    with open(os.path.join(DATA_FOLDER, fileName), 'r') as fin:
        return fin.read().replace('\n', '')

def ListCombo(seq1, seq2):
    '''combine the two DNA sequences into one'''
    # combines the codes from one half with their matching from the other
    return [''.join(pair) for pair in zip(seq1, seq2)]

def CountPairs(seq):
    '''counts occurences of pairs in the list of the combinations and stores
    them in a Counter dict instance keyed by letter-pairs'''
    return Counter(seq)

def PrintPairs(counter):
    #print the results in the counter dictionary (in sorted order)
    print('{' + ', '.join(('{}: {}'.format(pair, count)
        for pair, count in sorted(counter.items()))) + '}')

def ProcessSamples(file_name1, file_name2):
    # store both sides into a list of characters
    contentList1 = ListStore(file_name1)
    contentList2 = ListStore(file_name2)

    # combine the two lists together
    combo_list = ListCombo(contentList1, contentList2)

    # count the sample results and store into a dictionary
    counter = CountPairs(combo_list)

    #print the results
    PrintPairs(counter)

def main():
    '''reads the contents of N pairs of files, and finds out how many
    combinations exist in each'''
    file_names = ('a.txt', 'b.txt',
                  'c.txt', 'd.txt',
                  'e.txt', 'f.txt',
                  'g.txt', 'h.txt',)

    for (file_name1, file_name2) in zip(*([iter(file_names)]*2)):
        ProcessSamples(file_name1, file_name2)

if __name__ == '__main__':
    main()

score 1 · Accepted Answer

在您的printResult方法中，您遍历中的每个元素i，list并将值分配给字典中list.count(i)的键。iresult

我不完全确定它是如何count(i)工作的，但我相信它涉及搜索列表的大部分内容，并计算每次运行时的元素数量i。在您的代码中，如果您有重复项，例如 in ['aa','bb','aa']，您将计算列表中有多少元素'aa'两次，每次都遍历整个列表。这在长列表中非常耗时。

您只需要浏览一次列表即可计算每种类型的元素数量。我建议defaultdict为此使用 a ，因为您可以使每个新key的开始都使用默认值0。

    from collections import defaultdict
    result = defaultdict(int)
    for i in list:
        result[i] = result[i] + 1
    print result

创建一个defaultdictwithint允许每个新key的从 value 开始0。然后，您可以遍历列表一次，1每次找到它时都将其添加到每对的值中。这消除了多次浏览列表。

python - Python3：我用来计算组合结果的方法是否太慢？

3 回答 3

Related

Reference