所以,我只是想在一个包含与此类似的模式的巨大文件中计算单核苷酸频率(A、T、C、G):TTTGTATAAGAAAAAATAGG。
这会给我整个文件的一行输出,例如:
The single nucleotide frequency matrix of T.volcanium Genome is: {'A': [234235], 'C': [234290], 'G': [32456], 'T': [346875]}
这是我的代码(没有文件路径,打开,关闭和主)
def freq_dict_of_lists_v1(dna_list):
n = max([len(dna) for dna in dna_list])
frequency_matrix = {
'A': [0] * n,
'C': [0] * n,
'G': [0] * n,
'T': [0] * n,
}
for dna in dna_list:
for index, base in enumerate(dna):
frequency_matrix[base][index] += 1
return frequency_matrix
for line in file:
dna_list = file.readline().rstrip("\n")
frequency_matrix = freq_dict_of_lists_v1(dna_list)
print("The single nucleotide frequency matrix of T.volcanium Genome is: ")
pprint.pprint(frequency_matrix)
这是我的输出。
The single nucleotide frequency matrix of T.volcanium Genome is:
{'A': [21], 'C': [10], 'G': [11], 'T': [18]}
The single nucleotide frequency matrix of T.volcanium Genome is:
{'A': [31], 'C': [6], 'G': [4], 'T': [19]}
The single nucleotide frequency matrix of T.volcanium Genome is:
{'A': [23], 'C': [9], 'G': [10], 'T': [18]}
The single nucleotide frequency matrix of T.volcanium Genome is:
{'A': [17], 'C': [8], 'G': [9], 'T': [26]}
The single nucleotide frequency matrix of T.volcanium Genome is:
{'A': [15], 'C': [13], 'G': [9], 'T': [23]}
The single nucleotide frequency matrix of T.volcanium Genome is:
{'A': [21], 'C': [12], 'G': [10], 'T': [17]}
The single nucleotide frequency matrix of T.volcanium Genome is:
{'A': [20], 'C': [9], 'G': [12], 'T': [19]}
The single nucleotide frequency matrix of T.volcanium Genome is:
{'A': [15], 'C': [15], 'G': [10], 'T': [20]}
The single nucleotide frequency matrix of T.volcanium Genome is:
{'A': [20], 'C': [11], 'G': [10], 'T': [19]}
The single nucleotide frequency matrix of T.volcanium Genome is:
{'A': [26], 'C': [13], 'G': [7], 'T': [14]}
The single nucleotide frequency matrix of T.volcanium Genome is:
{'A': [12], 'C': [13], 'G': [13], 'T': [22]}
The single nucleotide frequency matrix of T.volcanium Genome is:
{'A': [20], 'C': [16], 'G': [9], 'T': [15]}
The single nucleotide frequency matrix of T.volcanium Genome is:
{'A': [22], 'C': [12], 'G': [6], 'T': [20]}
所以它每行计算它。我试过去掉for循环,或者去掉readlines,但是它只会给我一行输出,文件中只有一行。不是整个文件。
我觉得我想太多了。我确信有一种简单的方法可以读取整个文件并打印具有总频率的单行输出......任何见解都值得赞赏。