perl - 使用 .fasta 文件计算序列的相对内容

Question

所以我是我的“菜鸟”，最近才通过 Perl 介绍编程，我仍然习惯这一切。我有一个必须使用的 .fasta 文件，尽管我不确定我是否能够打开它，或者我是否必须“盲目地”使用它，可以这么说。

无论如何，我拥有的文件包含三个基因的 DNA 序列，以 .fasta 格式编写。

显然是这样的：

>label
sequence
>label
sequence
>label
sequence

我的目标是编写一个脚本来打开和读取文件，我现在已经掌握了窍门，但是我必须读取每个序列，计算每个序列中“G”和“C”的相对数量，然后我m 将基因的名称以及它们各自的“G”和“C”内容写入制表符分隔的文件。

有人可以提供一些指导吗？我不确定制表符分隔的文件是什么，我仍在尝试弄清楚如何打开 .fasta 文件以实际查看内容。到目前为止，我使用的是可以轻松打开的 .txt 文件，但不是 .fasta。

我为听起来完全困惑而道歉。我会很感激你的耐心。我不像你那里的专业人士！

score 0 · Accepted Answer

我知道这很令人困惑，但您确实应该尝试将您的问题限制在一个具体问题上，请参阅https://stackoverflow.com/faq#questions

我不知道“.fasta”文件或“G”和“C”是什么……但这可能无关紧要。

一般来说：

打开输入文件
读取和解析数据。如果它是某种您无法解析的奇怪格式，请在http://metacpan.org上寻找一个模块来阅读它。如果你幸运的话，有人已经为你完成了困难的部分。
计算您要计算的任何内容
打印到屏幕（标准输出）或其他文件。

“制表符分隔”文件是具有列的文件（想想 Excel），其中每列由制表符（“\t”）字符分隔。正如快速谷歌或stackoverflow搜索会告诉你的那样..

score 0 · Accepted Answer

这是一种使用“awk”实用程序的方法，可以从命令行使用。通过指定其路径并使用以下程序执行以下程序awk -f <path> <sequence file>

#NR>1 means only look at lines above 1 because you said the sequence starts on line 2 
NR>1{
#this for-loop goes through all bases in the line and then performs operations below:
    for (i=1;i<=length;i++)
#for each position encountered, the variable "total" is increased by 1 for total bases        
       total++
    } 
    {
    for (i=1;i<=length;i++)
#if the "substring" i.e. position in a line == c or g upper or lower (some bases are
#lowercase in some fasta files), it will carry out the following instructions:
        if(substr($0,i,1)=="c" || substr($0,i,1)=="C")
#this increments the c count by one for every c or C encountered, the next if statement does
#the same thing for g and G:            
            c++; else
        if(substr($0,i,1)=="g" || substr($0,i,1)=="G")
            g++
    }
    END{
#this "END-block" prints the gene name and C, G content in percentage, separated by tabs       
        print "Gene name\tG content:\t"(100*g/total)"%\tC content:\t"(100*c/total)"%"
    }

perl - 使用 .fasta 文件计算序列的相对内容

2 回答 2

Related

Reference