perl - 读取 .fasta 序列以提取核苷酸数据，然后写入 TabDelimited 文件

Question

在我继续之前，我想我应该向读者推荐我以前使用 Perl 遇到的问题，因为我是所有这些的初学者。

这些是我过去几天的帖子，按时间顺序排列：

现在正如我上面所说的，多亏了你们中的一些人的帮助，我已经设法弄清楚了前两个查询，并且我真的从中学到了东西。我真的很感激。对于一个对此一无所知，还觉得自己不知道的人来说，这种帮助简直是天赐之物。

最后一个查询仍未解决，这是一个延续。我确实看过一些推荐的文本，但由于我试图在周一之前完成，我不确定我是否完全忽略了任何内容。无论哪种方式，我都尝试过这项任务。

正如你所知，任务是打开并读取一个 .fasta 文件（我想我终于搞定了一些非常好的事情，哈利路亚！），读取每个序列，计算相对 G+C 核苷酸含量，然后写入TABDelimited 文件和基因名称及其各自的 G+C 内容。

尽管我已经尝试过这样做，但我知道我还没有准备好执行该程序以提供我所追求的结果，这就是为什么我再次与你们联系以获得一些指导，或如何进行此操作的示例。与我之前解决的查询一样，我希望它与我已经完成的查询具有相似的风格——即使它可能不是最方便/有效的方式。它只是让我知道我在每一步都在做什么，即使看起来我正在向它发送垃圾邮件！

无论如何，.fasta 文件的内容如下：

>label
sequence
>label
sequence
>label
sequence

我不确定如何打开 .fasta 文件，所以我不确定哪些标签适用于哪个，但我知道基因应该标记为gag、pol或env。我是否需要打开 .fasta 文件才能知道我在做什么，或者我可以通过上述格式“盲目”地做吗？

这可能是非常明显的，但我仍在努力解决所有这些问题。我觉得我现在应该赶上！

无论如何，我目前的代码如下：

#!/usr/bin/perl -w
# This script reads several sequences and computes the relative content of G+C of each sequence.
use strict; 

my $infile = "Lab1_seq.fasta";                               # This is the file path
open INFILE, $infile or die "Can't open $infile: $!";        # This opens file, but if file isn't there it mentions this will not open
my $outfile = "Lab1_SeqOutput.txt";             # This is the file's output
open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!"; # This opens the output file, otherwise it mentions this will not open

my $sequence = ();  # This sequence variable stores the sequences from the .fasta file
my $GC = 0;         # This variable checks for G + C content

my $line;                             # This reads the input file one-line-at-a-time
while ($line = <INFILE>) {
    chomp $line;                      # This removes "\n" at the end of each line (this is invisible)

    foreach my $line ($infile) {
        if($line = ~/^\s*$/) {         # This finds lines with whitespaces from the beginning to the ending of the sequence. Removes blank line.
            next;
        } elsif($line = ~/^\s*#/) {        # This finds lines with spaces before the hash character. Removes .fasta comment
            next; 
        } elsif($line = ~/^>/) {           # This finds lines with the '>' symbol at beginning of label. Removes .fasta label
            next;
        } else {
            $sequence = $line;
        }
    }
    {
        $sequence =~ s/\s//g;               # Whitespace characters are removed
        return $sequence;
    }

我不确定这里是否有任何问题，但是执行它给我留下了第 35 行的语法错误（在最后一行之外，因此那里什么都没有！）。它在'EOF'上说。这就是我能指出的所有内容。否则，我试图弄清楚如何计算每个序列中核苷酸 G + C 的数量，然后在输出 .txt 文件中正确地制表。我相信这就是 TABDelimited 文件的含义？

无论如何，如果此查询似乎太长、“愚蠢”或重复，我深表歉意，但话说回来，我找不到与此直接相关的任何信息，因此非常感谢您的帮助和解释如果可能的话，也为每个步骤！

最亲切的。

score 2 · Accepted Answer

你在接近尾声的时候有一个额外的支架。这应该有效：

#!/usr/bin/perl -w
# This script reads several sequences and computes the relative content of G+C of each sequence.

use strict; 

my $infile = "Lab1_seq.fasta";                               # This is the file path
open INFILE, $infile or die "Can't open $infile: $!";        # This opens file, but if file isn't there it mentions this will not open
my $outfile = "Lab1_SeqOutput.txt";             # This is the file's output
open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!"; # This opens the output file, otherwise it mentions this will not open

my $sequence = ();  # This sequence variable stores the sequences from the .fasta file
my $GC = 0;         # This variable checks for G + C content

my $line;                             # This reads the input file one-line-at-a-time

while ($line = <INFILE>) {
    chomp $line;                      # This removes "\n" at the end of each line (this is invisible)

    if($line =~ /^\s*$/) {         # This finds lines with whitespaces from the beginning to the ending of the sequence. Removes blank line.
        next;

    } elsif($line =~ /^\s*#/) {        # This finds lines with spaces before the hash character. Removes .fasta comment
        next; 
    } elsif($line =~ /^>/) {           # This finds lines with the '>' symbol at beginning of label. Removes .fasta label
        next;
    } else {
        $sequence = $line;
    }

    $sequence =~ s/\s//g;               # Whitespace characters are removed
    print OUTFILE $sequence;
}

我还编辑了您的退货线路。Return 将退出您的循环。我怀疑你想要的是将它打印到文件中，所以我已经做到了。您可能需要先进行一些进一步的转换才能将其转换为制表符分隔格式。

perl - 读取 .fasta 序列以提取核苷酸数据，然后写入 TabDelimited 文件

1 回答 1

Related

Reference