我有一个程序,它的目的是读取八个文件,这些文件长一百万个字符,没有标点符号,只有一堆字符。
这八个文件代表找到的四个 DNA 样本,程序所做的是从样本中的一个文件中提取字符,并将它们与同一样本的另一个文件中的字符组合。例如,如果 file1 读取:
abcdefg
和 file2 读取:
hijklmn
组合将是:
ah, bi, cj, dk, el, fm, gn
无论如何,程序会继续计算每对组合存在多少个组合,它会打印出一个字典,该字典将读取如下内容,例如:
{'mm': 52, 'CC': 66, 'SS': 24, 'cc': 19, 'MM': 26, 'ss': 58, 'TT': 43, 'tt': 32}
问题是,虽然该程序适用于小文件,但对于百万字符长(是的,这是一个字面数字,而不是夸张)文件,程序挂起,似乎永远无法完成任务。(我让它通宵运行一次,但没有任何结果。)
是溢出错误,还是我使用的方法对于大文件来说太小了?有没有更好的方法来处理这个?
我的代码:
import re
from collections import Counter
def ListStore(fileName):
'''Purpose, stores the contents of file into a single string'''
#old code left in for now
'''
with open(fileName, "r") as fin:
fileContents = fin.read().rstrip()
fileContents = re.sub(r'\W', '', fin.read())
'''
#opens up the file given to the function
fin = open(fileName,'r')
#reads the file into a string, strips out the newlines as well
fileContents = fin.read().rstrip()
#closes up the file
fin.close()
#splits up the fileContents into a list of characters
fileContentsList = list(fileContents)
#returns the string
return fileContentsList
def ListCombo(list1, list2):
'''Purpose: combines the two DNA lists into one'''
#creates an empty dictionary for list3
list3 = []
#combines the codes from one hlaf with their matching from the other
list3 = [''.join(pair) for pair in zip(list1, list2)]
return list3
def printResult(list):
'''stores the result of the combination in a dictionary'''
#stores the result into a dictionary
result = dict((i,list.count(i)) for i in list)
print (result)
return result
def main():
'''Purpose: Reads the contents of 8 files, and finds out how many
combinations exist'''
#first sample files
file_name = "a.txt"
file_name2 = "b.txt"
#second sample files
file_name3 = "c.txt"
file_name4 = "d.txt"
#third sample files
file_name5 = "e.txt"
file_name6 = "f.txt"
#fourth sample files
file_name7 = "g.txt"
file_name8 = "h.txt"
#Get the first sample ready
#store both sides into a list of characters
contentList = ListStore(file_name)
contentList2 = ListStore(file_name2)
#combine the two lists together
combo_list = ListCombo(contentList, contentList2)
#store the first sample results into a dictionary
SampleA = printResult(combo_list)
print (SampleA)
# ****Get the second sample ready****
#store both sides into a list of characters
contentList3 = ListStore(file_name3)
contentList4 = ListStore(file_name4)
#combine the two lists together
combo_list2 = ListCombo(contentList3, contentList4)
#store the first sample results into a dictionary
SampleB = printResult(combo_list2)
print (SampleB)
# ****Get the third sample ready****
#store both sides into a list of characters
contentList5 = ListStore(file_name5)
contentList6 = ListStore(file_name6)
#combine the two lists together
combo_list3 = ListCombo(contentList5, contentList6)
#store the third sample results into a dictionary
SampleC = printResult(combo_list3)
print (SampleC)
# ****Get the second sample ready****
#store both sides into a list of characters
contentList7 = ListStore(file_name7)
contentList8 = ListStore(file_name8)
#combine the two lists together
combo_list4 = ListCombo(contentList7, contentList8)
#store the fourth sample results into a dictionary
SampleD = printResult(combo_list4)
print (SampleD)
if __name__ == '__main__':
main()