-1

我编写了一个 PERL 程序,它采用一个 excel 表(通过将扩展名从 .xls 更改为 .txt 来转换为一个文本文件)和一个序列文件作为其输入。excel 表包含序列文件中需要剪切并提取到第三个输出文件中的区域的起点和终点(以及匹配区域两侧的 70 个侧翼值)。大约有 300 个值。该程序读取每次需要剪切的序列的起点和终点,但它反复告诉我该值超出了输入文件的长度,而显然不是。我似乎无法解决这个问题

这是程序

use strict;
use warnings;

my $blast;
my $i;
my $idline;
my $sequence;
print "Enter Your BLAST result file name:\t";
chomp( $blast = <STDIN> );    # BLAST result file name
print "\n";

my $database;
print "Enter Your Gene list file name:\t";
chomp( $database = <STDIN> );    # sequence file
print "\n";

open IN, "$blast" or die "Can not open file $blast: $!";

my @ids       = ();
my @seq_start = ();
my @seq_end   = ();

while (<IN>) {

    #spliting the result file based on each tab
    my @feilds = split( "\t", $_ );
    push( @ids, $feilds[0] );    #copying the name of sequence
         #coping the 6th tab value of the result which is the start point of from where a value should be cut.
    push( @seq_start, $feilds[6] );
    #coping the 7th tab value of the result file which is the end point of a value should be cut.
    push( @seq_end, $feilds[7] );
}
close IN;

open OUT, ">Result.fasta" or die "Can not open file $database: $!";

for ( $i = 0; $i <= $#ids; $i++ ) {

    ($sequence) = &block( $ids[$i] );

    ( $idline, $sequence ) = split( "\n", $sequence );

    #extracting the sequence from the start point to the end point
    my $seqlen = $seq_end[$i] - $seq_start[$i] - 1;

    my $Nucleotides = substr( $sequence, $seq_start[$i], $seqlen );  #storing the extracted substring into $sequence

    $Nucleotides =~ s/(.{1,60})/$1\n/gs;

    print OUT "$idline\n";
    print OUT "$Nucleotides\n";
}
print "\nExtraction Completed...";

sub block {
    #block for id storage which is the first tab in the Blast output file.
    my $id1 = shift;
    print "$id1\n";
    my $start = ();

    open IN3, "$database" or die "Can not open file $database: $!";

    my $blockseq = "";
    while (<IN3>) {

        if ( ( $_ =~ /^>/ ) && ($start) ) {

            last;
        }

        if ( ( $_ !~ /^>/ ) && ($start) ) {

            chomp;
            $blockseq .= $_;
        }

        if (/^>$id1/) {

            my $start = $. - 1;
            my $blockseq .= $_;
        }
    }
    close IN3;

    return ($blockseq);
}

爆炸结果文件:http ://www.fileswap.com/dl/Ws7ehftejp/

序列文件:http ://www.fileswap.com/dl/lPwuGh2oKM/

错误

substr 在 Nucleotide_Extractor.pl 第 39 行的字符串之外。

在 Nucleotide_Extractor.pl 第 41 行使用未初始化值 $Nucleotides 替换 (s///)。

在连接 (.) 或 Nucleotide_Extractor.pl 第 44 行的字符串中使用未初始化的值 $Nucleotides。

非常感谢您的任何帮助,并始终邀请您提出疑问

4

1 回答 1

2

现有代码存在几个问题,我最终在修复错误的同时重写了脚本。您的实现效率不高,因为它打开、读取和关闭 Excel 工作表中每个 ID 的序列文件。更好的方法是从序列文件中读取和存储数据,或者,如果内存有限,则遍历序列文件中的每个条目并从 Excel 文件中挑选出相应的数据。最好使用散列而不是数组;散列将数据存储在键值对中,因此更容易找到您要查找的内容。我在整个过程中也使用了引用,因为它们可以很容易地将数据传入和传出子例程。

如果您不熟悉 perl 数据结构,请查看perlfaq4perldsc,并且perlreftut有关于使用引用的信息。

您现有代码的主要问题是从 fasta 文件中获取序列的子例程没有返回任何内容。在您的代码中放置大量调试语句以确保它正在执行您认为它正在执行的操作是一个好主意。我保留了调试语句,但将它们注释掉了。我还大量评论了我更改的代码。

#!/usr/bin/perl
use strict;
use warnings;
# enables 'say', which prints out your text and adds a carriage return
use feature ':5.10';
# a very useful module for dumping out data structures
use Data::Dumper;

#my $blast = 'infesmall.txt';
print "Enter Your BLAST result file name:\t";
chomp($blast = <STDIN>);     # BLAST result file name
print "\n";

#my $database = 'infe.fasta';
print "Enter Your Gene list file name:\t";
chomp($database = <STDIN>);  # sequence file
print "\n";

open IN,"$blast" or die "Can not open file $blast: $!";

# instead of using three arrays, let's use a hash reference!
# for each ID, we want to store the start and the end point. To do that,
# we'll use a hash of hashes. The start and end information will be in one
# hash reference:
# { start => $fields[6], end => $fields[7] }
# and we will use that hashref as the value in another hash, where the key is
# the ID, $fields[0]. This means we can access the start or end data using
# code like this:
#   $info->{$id}{start}
#   $info->{$id}{end}
my $info;

while(<IN>){
    #splitting the result file based on each tab
    my @fields = split("\t",$_);
    # add the data to our $info hashref with the ID as the key:
    $info->{ $fields[0] } = { start => $fields[6], end => $fields[7] };
}
close IN;

#say "info: " . Dumper($info);

# now read the sequence info from the fasta file
my $sequence = read_sequences($database);
#say "data from read_sequences:\n" . Dumper($sequence);

my $out = 'result.fasta';
open(OUT, ">" . $out) or die "Can not open file $out: $!";

foreach my $id (keys %$info) {

    # check whether the sequence exists
    if ($sequence->{$id}) {
        #extracting the sequence from the start point to the end point
        my $seqlen = $info->{$id}{end} - $info->{$id}{start} - 1;

        #say "seqlen: $seqlen; stored seq length: " . length($sequence->{$id}{seq}) . "; start: " . $info->{$id}{start} . "; end: " . $info->{$id}{end};

        #storing the extracted substring into $sequence
        my $nucleotides = substr($sequence->{$id}{seq}, $info->{$id}{start}, $seqlen);
        $nucleotides =~ s/(.{1,60})/$1\n/gs;
        #say "nucleotides: $nucleotides";
        print OUT $sequence->{$id}{header} . "\n";
        print OUT "$nucleotides\n";
    }
}
print "\nExtraction Completed...";

sub read_sequences {
    # fasta file
    my $fasta_file = shift;

    open IN3, "$fasta_file" or die "Can not open file $fasta_file: $!";

    # initialise two variables. We will store our sequence data in $fasta
    # and use $id to track the current sequence ID
    # the $fasta hash will look like this:
    # $fasta = {
    #   'gi|7212472|ref|NC_002387.2' => {
    #       header => '>gi|7212472|ref|NC_002387.2| Phytophthora...',
    #       seq => 'ATAAAATAATATGAATAAATTAAAACCAAGAAATAAAATATGTT...',
    #   }
    #}

    my ($fasta, $id);

    while(<IN3>){
        chomp;
        if (/^>/) {
            if (/^>(\S+) /){
                # the header line with the sequence info.
                $id = $1;
                # save the data to the $fasta hash, keyed by seq ID
                # we're going to build up an entry as we go along
                # set the header to the current line
                $fasta->{ $id }{ header } = $_;
            }
            else {
                # no ID found! Erk. Emit an error and undef $id.
                warn "Formatting error: $_";
                undef $id;
            }
        }
        ## ensure we're getting sequence lines...
        elsif (/^[ATGC]/) {
            # if $id is not defined, there's something weird going on, so
            # don't save the sequence. In a correctly-formatted file, this
            # should not be an issue.
            if ($id) {
                # if $id is set, add the line to the sequence.
                $fasta->{ $id }{ seq } .= $_;
            }
        }
    }
    close IN3;
    return $fasta;
}
于 2014-09-19T16:10:14.057 回答