0

我在处理 .fasta 格式的 DNA 序列数据数组时遇到了一些麻烦。我特别想做的是获取一个包含几千个序列的文件,并将文件中每个序列的序列数据连接到文件中的一行。[Fasta 格式是这样的:序列 ID 以 > 开头,之后该行的所有内容都是描述。在下一行中,存在与此 ID 对应的序列。这可以无限期地持续到以 > 开头的下一行,这是文件中下一个序列的 id] 所以,在我的特定文件中,我的大部分序列都在多行上,所以我想做的基本上是删除换行符,但仅删除序列数据之间的新行,而不是序列数据和序列 ID 行(以 > 开头)之间的新行。

我这样做是因为我希望能够获得每个序列的序列长度(通过长度,我相信是最简单的方法),然后获得整个文件中所有序列的平均序列长度。

到目前为止,这是我的脚本,似乎不想工作:

#!/usr/bin/perl -w


##Subroutine
sub get_file_data1 { 
    my($filename) = $_[0];
    my @filedata = ();
    unless( open(GET_FILE_DATA, $filename)) {
    print STDERR "Cannot open file \"$filename\"\n\n";
    exit;
    }
    @filedata = <GET_FILE_DATA>;
    close GET_FILE_DATA;
    return @filedata;
}



##Opening files
my $fsafile = $ARGV[0];
my @filedata = &get_file_data1($fsafile);


##Procedure
my @count;
my @ids;
my $seq;

foreach $seq (@filedata){
        if ($seq =~ /^>/) {push @ids, $seq;
                                 push @count, "\n";
    }
        else {push @count, $seq;
    }
}


foreach my $line (@count) {
    if ($line =~ /^[AGTCagtc]/){
         $line =~ s/^([AGTCagtc]*)\n/$1/;
    }
}

##Make a text file to have a look
open FILE3, "> unbrokenseq.txt" or die "Cannot open output.txt: $!";

foreach (@count)
{
    print FILE3 "$_\n"; # Print each entry in our array to the file
}
close FILE3;


__END__
##Creating array of lengths
my $number;
my @numberarray;
foreach $number (@count) {
                push @numberarray, length($number);
                }
print @numberarray;


__END__
use List::Util qw(sum);

sub mean {
    return sum(@numberarray)/@numberarray;
}

程序部分的第二行foreach有问题,我似乎无法弄清楚它是什么。请注意,我什至还没有尝试过 END 行之后的代码,因为我似乎无法在过程步骤中获取代码来执行我想要的操作。知道如何获得一个包含完整序列元素的漂亮数组(我选择只从新数组中删除序列 ID 行..)吗?当我可以得到一个长度数组,然后我可以平均?

最后我很遗憾地承认我无法让 Bio::Perl 在我的计算机上工作,我已经尝试了几个小时,但这些错误超出了我的修复能力。我将与希望能帮助我解决 Bio::perl 问题的人交谈。但现在我只需要在没有它的情况下继续前进。

谢谢!抱歉这篇文章的长度,我很感激帮助。

安德鲁

4

3 回答 3

0

The problem with your second loop is that you are not actually changing anything in @count because $line contains a copy of the values in @count.

But, if all you want to do in the second loop is to remove the newline character at the end, use the chomp function. with this you wouldn't need your second loop. (And it would also be faster than using the regex.)

# remove newlines for all array elements before doing anything else with it
chomp @filedata;

# .. or you can do it in your first loop
foreach $seq (@filedata){
    chomp $seq;
    if ($seq =~ /^>/) {
    ...
}

An additional tip: Using get_file_data1 to read the entire file into an array might be slow if your files are large. In that case it would be better to iterate through the file as you go:

open my $FILE_DATA, $filename or die "Cannot open file \"$filename\"\n";
while (my $line = <$FILE_DATA>) {
    chomp $line;
    # process the record as in your Procedure section
    ...
}
close $FILE_DATA;
于 2012-05-11T02:10:53.917 回答
-1

Be careful with the '*' or 'greedy' modifier to your character groups in s///. You usually want the '+' instead. '*' will also match lines containing none of your characters.

A Search expression with a 'g' modifier can also count characters. Like this:

$perl -e '$a="aggaacaat"; $b = $a =~ s/[a]//g; print $b; '
5

Pretty cool huh! Alternately, in your code, you could just call length() against $1.

I was taken aback to see the escaped '/n' in your regex. While it works fine, the common 'end-of-line' search term is '$'. This is more portable and doesn't mess up your character counts.

于 2012-05-11T02:09:13.977 回答
-1

您的正则表达式专门捕获 $1 但您正在将 $_ 打印到文件中。结果很可能不是您想要的。

于 2012-05-10T23:34:24.557 回答