向全世界的 Perl 大师问好。
我在编程方面遇到了另一个麻烦。我正在编写一个程序,该程序从具有特定输入数的蛋白质 fasta 文件中选择随机序列。
一般的 fasta 文件如下所示:
>seq_ID_1 描述等 ASDGDSAHSAHASDFRHGSDHSDGEWTSHSDHDSHFSDGSGASGADGHHAH ASDSADGDASHDASHSAREWAWGDASHASGASGASGSDGASDGDSAHSHAS SFASGDASGDSSDFDSFSDFSD
>seq_ID_2 描述等 ASDGDSAHSAHASDFRHGSDHSDGEWTSHSDHDSHFSDGSGASGADGHHAH ASDSADGDASHDASHSAREWAWGDASHASGASGASG
等等.......
字母代表氨基酸肽。
所以我有一个包含 1000 个序列的 fasta 文件,想要检索其中的 63.21%,即 632.1 个序列。但是序列不能是浮点数,所以如果它超过 0.5 我想向上取整,如果小于 0.5 向下取整。
这是我生成随机序列子集的代码,但它不太擅长工作。
#!/usr/bin/perl
#Selecting 63.21% of random sequnces from a proteom file.
use strict;
use warnings;
use List::Util qw(shuffle);
#Give the first argument as a proteom file.
if (@ARGV != 1)
{
print "Invalid arguments\n";
print "Usage: perl randseq.pl [proteom_file]";
exit(0);
}
my $FILE = $ARGV[0];
my $i = 0;
my %protseq = {};
my $nIdx = 0;
#Extraction and counting of the all headers from a proteom file.
open(LIST,$FILE);
open(TEMP1, ">temp1");
while (my $line = <LIST>){
chomp $line;
if ($line =~ />(\S+) (.+)/){
$i++;
print TEMP1 $1,"\n";
}
}
close(LIST);
close(TEMP1);
#Selection of random headers for generating a random subset of the proteom file.
my $GET_LINES = RoundToInt ($i*0.6321);
my @line_starts;
open(my $FH,'<','temp1');
open(TEMP2, ">temp2");
do {
push @line_starts, tell $FH
} while ( <$FH> );
my $count = @line_starts;
my @shuffled_starts = (shuffle @line_starts)[1..$GET_LINES+1];
for my $start ( @shuffled_starts ) {
seek $FH, $start, 0
or die "Unable to seek to line - $!\n";
print TEMP2 scalar <$FH>;
}
close(TEMP2);
#Assigning the sequence data to randomly generated header file.
open(DATA,'<','temp2');
while(my $line = <DATA>)
{
chomp($line);
$line =~ s/[\t\s]//g;
if($line =~ /^([^\s]+)/)
{
$protseq{$1}++;
}
}
close(DATA);
open(DATA, "$FILE");
open(OUT, ">random_seqs.fasta");
while(my $line = <DATA>)
{
chomp($line);
if($line =~ /^>([^\s]+)/)
{
if($protseq{$1} ne "")
{
$nIdx = 1;
print OUT "$line\n";
}
else
{
$nIdx = 0;
}
}
else
{
if($nIdx == 1)
{
print OUT "$line\n";
}
}
}
close(DATA);
close(OUT);
#subroutine for rounding
sub RoundToInt {
int($_[0] + .5 * ($_[0] <=> 0));
}
system("erase temp1");
system("erase temp2");
exit;
但是,它有时会给出适当数量的序列,有时会给出一个更多的序列。我怎样才能摆脱它......请有什么想法吗?
或者也许更好的短代码?
在这里你可以得到一个 75 酵母蛋白质组文件。[http://www.peroxisomedb.org/Download/Saccharomyces_cerevisiae.fas][1]
希望我能尽快解决这个问题...... :(