bioinformatics - 按 GC 内容对序列读取进行分箱

Question

我想“装箱”（拆分成单独的文件）一个多fasta核苷酸序列文件（例如，Roche-454运行约500,000次读取，平均读取长度为250bp）。我想要基于每次读取的 GC 内容的垃圾箱。结果输出将是 8 个多 fasta 文件：

<20% GC含量

21-30% GC含量

31-40% GC含量

41-50% GC含量

51-60% GC含量

61-70% GC含量

71-80% GC含量

>80 % GC 含量

有谁知道已经这样做的脚本或程序？如果没有，有人可以建议如何根据 GC 内容对多 fasta 文件进行排序（然后我可以将其拆分为相关的垃圾箱）？

score 2 · Accepted Answer

在 R / Bioconductor 中，任务将是 (a) 加载适当的库 (b) 读取 fasta 文件 (c) 计算核苷酸使用和 gc % (d) 将数据切割成 bin 和 (e) 将原始数据输出到单独的文件。沿着

## load
library(ShortRead)
## input
fa = readFasta("/path/to/fasta.fasta")
## gc content. 'abc' is a matrix, [, c("G", "C")] selects two columns
abc = alphabetFrequency(sread(fa), baseOnly=TRUE)
gc = rowSums(abc[,c("G", "C")]) / rowSums(abc)
## cut gc content into bins, with breaks at seq(0, 1, .2)
bin = cut(gc, seq(0, 1, .2))
## output, [bin==lvl] selects the reads whose 'bin' value is lvl
for (lvl in levels(bin)) {
    fileName = sprintf("%s.fasta", lvl)
    writeFasta(fa[bin==lvl], file=fileName)
}

要开始使用 R / Bioconductor，请参阅http://bioconductor.org/install。所示大小的 454 数据的内存要求并不算太差，这里的脚本会相当快（例如，260k 读取需要 7 秒）。

score 1 · Accepted Answer

我建议使用 Python 和Biopython或 Perl 和Bioperl来读取 FASTA 文件。这里有一个计算 Bioperl 中序列的 C 含量的脚本，Biopython有一个函数。然后简单地将每个序列的 GC 内容存储在字典或哈希中，然后遍历每个序列，根据 GC 内容的高低将它们写入文件。

您需要更具体的帮助吗？

score 0 · Accepted Answer

如果我正确理解了问题，您需要类似以下内容（Python）：

def GC(seq): # determine the GC content
    s = seq.upper()
    return 100.0 * (s.count('G') + s.count('C')) / len(s)

def bin(gc): # get the number of the 'bin' for this value of GC content
    if gc < 20: return 1
    elif gc > 80: return 8
    else:
        return int(gc/10)

然后你只需要从文件中读取条目，计算 GC 内容，找到正确的 bin 并将条目写入相应的文件。以下示例使用我们在实验室中使用的 Python包实现了这一点：

from pyteomics import fasta

def split_to_bin_files(multifile):
"""Reads a file and writes the entries to appropriate 'bin' files.
`multifile` must be a filename (str)"""

    for entry in fasta.read(multifile):
        fasta.write((entry,), (multifile+'_bin_'+
                    str(bin(GC(entry[1])))))

然后你就这样称呼它split_to_bin_files('mybig.fasta')。

bioinformatics - 按 GC 内容对序列读取进行分箱

3 回答 3

Related

Reference