perl - Perl 程序将完整的 .fasta 文件序列打印到文件中，但试图实现与基因相关的特定核苷酸计数

Question

到目前为止，我已经设法对 Perl 有了更多的了解，这是一种解脱，我要感谢你们。我目前仍在研究另一个方面，我需要读取 .fasta 文件并找到所有 G 和 C 核苷酸，然后创建一个制表符分隔的文件。

这些是我过去几天的帖子，按时间顺序排列：

如何从制表符分隔的数据中平均列值... （已解决）
为什么我在输出文件中看不到计算结果？ （解决了）
使用 .fasta 文件计算序列的相对内容
读取 .fasta 序列以提取核苷酸数据，然后... （在此之前发布）

最后一个查询仍在处理中，但我已经取得了一些进展。

在某些背景下，.fasta 文件的内容如下：

>label
sequence
>label
sequence
>label
sequence

我不确定如何打开 .fasta 文件，所以我不确定哪些标签适用于哪个，但我知道基因应该标记为gag、pol或env。我是否需要打开 .fasta 文件才能知道我在做什么，或者我可以通过上述格式“盲目”地做吗？

无论如何，我目前的代码如下：

#!/usr/bin/perl -w
# This script reads several sequences and computes the relative content of G+C of each sequence.

use strict; 

my $infile = "Lab1_seq.fasta";                               # This is the file path
open INFILE, $infile or die "Can't open $infile: $!";        # This opens file, but if file isn't there it mentions this will not open
my $outfile = "Lab1_SeqOutput.txt";             # This is the file's output
open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!"; # This opens the output file, otherwise it mentions this will not open

my $sequence = ();  # This sequence variable stores the sequences from the .fasta file
my $GC = 0;         # This variable checks for G + C content

my $line;                             # This reads the input file one-line-at-a-time

while ($line = <INFILE>) {
    chomp $line;                      # This removes "\n" at the end of each line (this is invisible)

    if($line =~ /^\s*$/) {         # This finds lines with whitespaces from the beginning to the ending of the sequence. Removes blank line.
        next;

    } elsif($line =~ qr(^\s*\#/)) {        # This finds lines with spaces before the hash character. Removes .fasta comment
        next; 
    } elsif($line =~ /^>/) {           # This finds lines with the '>' symbol at beginning of label. Removes .fasta label
        next;
    } else {
        $sequence = $line;
    }

    $sequence =~ s/\s//g;               # Whitespace characters are removed
    print OUTFILE $sequence;
}

该代码现在将整个序列打印到文本文件中，没有空格。唯一的问题是，我不知道序列从哪里开始或结束，所以我不确定哪些序列适用于每个基因。虽然停止/起始密码子应该给我一个指示。考虑到这一点，我将如何修改/添加到代码中以计算序列中 G+C 的数量，然后将它们打印到一个制表符分隔的文件中，其中包含与它们各自的 G/C 内容相关的基因名称?

我期待听到有人可以提供一些指导，与上面发布的代码类似，关于如何找到 G/C，然后将各自的计数制成表格。

score 2 · Accepted Answer

以下链接可能会有所帮助。已经编写了很多代码，Bio::SeqIO和Bio::Seq似乎经常被讨论。BioPerl有一个网站，但我不熟悉它。那里有代码示例和其他信息。常见问题解答也很有帮助。

这是来自 Bio::SeqIO 文档的示例。

use Bio::SeqIO;

$in  = Bio::SeqIO->new(-file => "inputfilename" ,
                       -format => 'Fasta');
$out = Bio::SeqIO->new(-file => ">outputfilename" ,
                       -format => 'EMBL');

while ( my $seq = $in->next_seq() ) {
        $out->write_seq($seq);
}

score 1 · Accepted Answer

我实际上自己使用FASTA文件。所以，我感受到了你的痛苦。

回答您关于标签对每个序列的适用性的重复问题：如果文件格式正确，则序列信息之前的每个标签都应该用于后面的序列。因此，您应该按如下方式从头到尾解析文件：

>label1
sequence1
>label2
sequence2
>label3
sequence3
...

...其中每个标签表示要遵循新的序列信息。您还需要忽略以分号 ( ;) 开头的行，因为这些行也表示遗留注释字段。

否则，您似乎在重排文件时正确删除了空格。我建议使用换行符保持标签字段完整，因此您的输出文件看起来像上面提到的格式，删除了注释和空格。

一旦你有了这个，就很简单了，遍历重排文件，抓取你需要的序列片段，并在遇到新标签时重新启动计数器。

perl - Perl 程序将完整的 .fasta 文件序列打印到文件中，但试图实现与基因相关的特定核苷酸计数

2 回答 2

Related

Reference