python - 如何从整个基因组文件中计算一个频率矩阵？

Question

所以，我只是想在一个包含与此类似的模式的巨大文件中计算单核苷酸频率（A、T、C、G）：TTTGTATAAGAAAAAATAGG。

这会给我整个文件的一行输出，例如：

The single nucleotide frequency matrix of T.volcanium Genome is: {'A': [234235], 'C': [234290], 'G': [32456], 'T': [346875]}

这是我的代码（没有文件路径，打开，关闭和主）

 def freq_dict_of_lists_v1(dna_list):
    n = max([len(dna) for dna in dna_list])
    frequency_matrix = {
        'A': [0] * n,
        'C': [0] * n,
        'G': [0] * n,
        'T': [0] * n,
    }
    for dna in dna_list:
        for index, base in enumerate(dna):
            frequency_matrix[base][index] += 1

    return frequency_matrix

for line in file:
    dna_list = file.readline().rstrip("\n")
    frequency_matrix = freq_dict_of_lists_v1(dna_list)
    print("The single nucleotide frequency matrix of T.volcanium Genome is: ")
    pprint.pprint(frequency_matrix)

这是我的输出。

The single nucleotide frequency matrix of T.volcanium Genome is: 
{'A': [21], 'C': [10], 'G': [11], 'T': [18]}
The single nucleotide frequency matrix of T.volcanium Genome is: 
{'A': [31], 'C': [6], 'G': [4], 'T': [19]}
The single nucleotide frequency matrix of T.volcanium Genome is: 
{'A': [23], 'C': [9], 'G': [10], 'T': [18]}
The single nucleotide frequency matrix of T.volcanium Genome is: 
{'A': [17], 'C': [8], 'G': [9], 'T': [26]}
The single nucleotide frequency matrix of T.volcanium Genome is: 
{'A': [15], 'C': [13], 'G': [9], 'T': [23]}
The single nucleotide frequency matrix of T.volcanium Genome is: 
{'A': [21], 'C': [12], 'G': [10], 'T': [17]}
The single nucleotide frequency matrix of T.volcanium Genome is: 
{'A': [20], 'C': [9], 'G': [12], 'T': [19]}
The single nucleotide frequency matrix of T.volcanium Genome is: 
{'A': [15], 'C': [15], 'G': [10], 'T': [20]}
The single nucleotide frequency matrix of T.volcanium Genome is: 
{'A': [20], 'C': [11], 'G': [10], 'T': [19]}
The single nucleotide frequency matrix of T.volcanium Genome is: 
{'A': [26], 'C': [13], 'G': [7], 'T': [14]}
The single nucleotide frequency matrix of T.volcanium Genome is: 
{'A': [12], 'C': [13], 'G': [13], 'T': [22]}
The single nucleotide frequency matrix of T.volcanium Genome is: 
{'A': [20], 'C': [16], 'G': [9], 'T': [15]}
The single nucleotide frequency matrix of T.volcanium Genome is: 
{'A': [22], 'C': [12], 'G': [6], 'T': [20]}

所以它每行计算它。我试过去掉for循环，或者去掉readlines，但是它只会给我一行输出，文件中只有一行。不是整个文件。

我觉得我想太多了。我确信有一种简单的方法可以读取整个文件并打印具有总频率的单行输出......任何见解都值得赞赏。

score 0 · Accepted Answer

不确定 HUGE 是什么意思 MB？GB?，但这是最简单的解决方案。但是请注意，它将整个文件加载到内存中。

# open file with sequence
with open(path_to_file) as f:
    seq = f.read()

# count element A in sequence
seq.count('A')

score 0 · Accepted Answer

我发现您的解决方案存在两个问题。

您正在跟踪每个位置的碱基，当在您的问题中它说您想要跟踪所有行的计数时
您每行调用一次该函数。

我在下面的编辑应该解决。看评论解释

def freq_dict_of_lists_v1(dna_list):
    frequency_matrix = {    # We are only keeping one variable per base
        'A': [0],           # so that we calculate counts across all lines
        'C': [0],
        'G': [0],
        'T': [0],
    }
    for dna in dna_list:
        for base in dna:   # No longer need index, so I removed enumerate
            frequency_matrix[base] += 1   # Change here since dict structure changed

    return frequency_matrix

# Unlike before, we are now appending all the lines into dna_list
for line in file:
    dna_list.append(file.readline().rstrip("\n"))

# Calling freq_dict_of_lists_v1 on ALL the lines at once (it is now out of loop)
frequency_matrix = freq_dict_of_lists_v1(dna_list)
print("The single nucleotide frequency matrix of T.volcanium Genome is: ")
pprint.pprint(frequency_matrix)

此解决方案的一个警告是确保文件中的所有碱基都是大写。此外，请确保没有非 ACGT 字符（某些序列具有特殊的间隙字符等）。如果是这种情况，还有其他字符，您可以参考此线程，您的默认条目可能类似于Gap.

python - 如何从整个基因组文件中计算一个频率矩阵？

2 回答 2

Related

Reference