perl - 基于位置提取fasta序列

Question

我是perl的新手。还在学习。

我有一个fasta格式的文件。我想提取跨越特定位置的序列。例如，从位置 200 到 300

>Contig[0001]
TGCATCAAAAGCTGAAAATATGTAGTCGAGAAGTCATTTCGAGAAATTGACGTTTTAAGT
TTCGGTTTCCAAATTCAACCGGATGTATCTTCGCCAATAATTGTCAGCAGTTAGAATTTC
TTTCAACATTATGAAGCCCTTTTTATATATTTTGATTCTGCATCAAAAGCTGAAAATATG
TAGTCTTGAAGTCATTTCGAGAAATCGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGG
ATGTATCTTCGCCAATAATTGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTT
TTATATATTTTGATTCTGCATCAAAAGCTGAAAATGTGTAGTCTCGAAGTCATTTCGAGA
AATTGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTCGCCAATAATTGT
CAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTTTTACATATTTTGACCCTGCATC
AAAAGCTGAAAATATGTAGTCTCGAAGTCATTTTGAGAAGTTAGAATTTCTTTCAACATT
ATGAAGCCCTTTTTATATATTTTGATTCTGCATCAAAAGCTGAAAATATGTAGTCTCGAA
GTCWTTTCRAGAAATTGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTC
GCCAATAATTGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTTTTATATATTT
TGACTCTGCATCAAAAGCTGAAAATATGTAGTCTCGAAGTCATTTCGAGAAATTGACGTT

我想从序列中提取位置 200-300 的序列Contig[0001]。输出将是：

>Contig[0001]_200-300
AGAAATCGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTCGCCAATAATT
GTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTT

我的 fasta 文件中有近 500 个序列，并且在包含 id start end 的制表符分隔文件中有所需的位置。

如果有人能在这方面帮助我，那就太好了。

非常感谢您的所有帮助。我不确定我是否可以提供包含有关职位信息的文件。

新手

score 3 · Accepted Answer

单程。内容script.pl：

#!/usr/bin/env perl

use warnings;
use strict;

my ($adn, $l, $header);

while ( <> ) { 
    chomp;

    ## First line is known, a header, so print it and process next one.
    if ( $. == 1 ) { 
        printf qq|%s_%s\n|, $_, q|200-300|;
        next;
    }   

    ## Concat adn while not found a header.
    if ( '>' ne substr $_, 0, 1 ) { 
        if ( ! $l ) { $l = length }
        $adn .= $_; 
        if ( ! eof ) { next }
    }   
    else {
        $header = sprintf qq|%s_%s\n|, $_, q|200-300|;
    }   

    ## Extract range 200-300 and insert newlines to set same length of 
    ## line than before.
    my $s = substr $adn, 199, 100;
    $s =~ s/(.{$l})/$1\n/g;
    printf qq|%s\n|, $s; 
    undef $adn;

    ## If not end of file, print the header of the following adn.
    if ( ! eof ) { print $header }
}

像这样运行它：

perl script.pl infile

这会产生：

>Contig[0001]_200-300
AGAAATCGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTCGCCAATAAT
TGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTT

score 2 · Accepted Answer

我假设您知道如何将其放入您的程序中。看看substr功能。它做你想做的事。

substr EXPR,OFFSET,LENGTH

所以基本上你需要做的是：

my $snippet = substr($sequence, 200, 100);

再想一想，看看 CPAN：有一个名为Bio::SeqReader::Fasta的模块，您可以使用它来读取文件并获取序列。在我看来，它有很好的记录，我对此很感兴趣。所以这里有一个解决方案。

use strict;
use warnings;
use feature qw(say);
use Bio::SeqReader::Fasta;

# Put your positions here (start, end)
my @positions = (
  [ 200, 300 ], 
  [ 50, 180 ],
);

open my $fh, '<', '/path/to/file.fasta' or die $!;
my $in = new Bio::SeqReader::Fasta( fh => $fh ); # create the B::SR::F object

my $i = 0;
while ( my $so = $in->next() ) {
  # $so represents one dataset and is a Bio::SeqReader::FastaRecord

  say substr($so->seq(),               # we want a part of the sequence string
    $positions[$i]->[0],               # starting position
    $positions[$i]->[1] - $positions[$i]->[0]); # end - start --> length
}

close $fh;

score 2 · Accepted Answer

一种使用 Perl 和 Bio::SeqIO 模块的方法。像这样运行：

./process_fasta.pl file.fa 200 300

内容./process_fasta.pl：

#!/usr/bin/perl 

use strict;
use warnings;
use Bio::SeqIO;

my $in_file = $ARGV[0];
my $start_pos = $ARGV[1];
my $end_pos = $ARGV[2];

my $in = Bio::SeqIO->new ( -file => $in_file, -format => 'fasta');
my $out = Bio::SeqIO->new( -file => ">$in_file.out", -format => 'fasta');


while (my $seq = $in->next_seq() ) {

    $seq->display_id( $seq->display_id() . "_$start_pos-$end_pos" );
    $out->write_seq( $seq->trunc($start_pos, $end_pos) );
}

结果：

>Contig[0001]_200-300
AGAAATCGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTCGCCAATAAT
TGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTT

score 1 · Accepted Answer

完整的工作轻量级脚本：

#!/usr/bin/env perl

use strict;
use warnings;

my $first=1;
if ( $ARGV[0] eq '-0' )
{
    shift @ARGV;
    $first=0;
}

die("Usage: cat <coords> | get_subseqs.pl (-0) <fasta files> > out; where coords is id, start, end. Use -0 if coords start with 0 instead of 1.\n") unless @ARGV;

# read coords
my %contigs = (); # id => [ start, end ]
while (<STDIN>) {
    chomp;
    my @row = split(/\t/);
    die("Require tab-delim: id, start, end\n") unless @row == 3;
    $contigs{$row[0]} = [ $row[1], $row[2] ];
}

# get subseqs
my ($id,$seq,$start,$end);
foreach my $infile (@ARGV) {
    open(IN, '<', $infile) or die($!);
    while (<IN>) {
        if (/>(\S+)/) {
            my $id2 = $1;
            print ">${id}_$start-$end\n", reformat(substr($seq, $start-$first, $end-$start+1)) if $id;
            if ( exists($contigs{$id2}) ) {
                ($id,$seq,$start,$end) = ($id2,'',@{$contigs{$id2}});
                delete($contigs{$id2});
            } else { $id=undef }
        } elsif($id) { $seq .= $_ }
    }
    close(IN);
    print ">${id}_$start-$end\n", reformat(substr($seq, $start-$first, $end-$start+1)) if $id;
    $id = undef;
}

sub reformat { # add newline every 60 bases
    my $seq = shift;
    my $seq2 = '';
    while ( length($seq) > 60 ) {
        $seq2 .= substr($seq,0,60)."\n";
        $seq = substr($seq,60);
    }
    $seq2 .= $seq."\n";
    return $seq2;
}

score 1 · Accepted Answer

我为那些使用这些Bio::模块的人鼓掌，因为我更喜欢他们而不是写新东西。不过，也许以下内容会有所帮助：

use strict;
use warnings;

my $end   = pop;
my $start = pop;
local $/ = '>';

while (<>) {
    chomp;
    next unless /(Contig.+)/;
    my ($header) = "$/$1_$start-$end\n";
    my $seq = ${^POSTMATCH};
    $seq =~ s/\s//g;
    print $header;
    print +( substr $seq, $start - 1, $end - $start ) . "\n";
}

用法：perl script.pl inFile start end [>outFile]

最后一个可选参数将输出定向到文件。

例子：perl script.pl data.fasta 200 300

数据集上的输出：

>Contig[0001]_200-300
AGAAATCGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTCGCCAATAATTGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTT

start 和 end 参数被popped off @ARGV，然后记录分隔符设置为“>”。当文件被读取时——一次一个 fasta 记录——头信息被捕获，将序列留在${^POSTMATCH}. 从序列中删除所有空白。最后，重新格式化的标题被print编辑，序列中的字符范围也是如此。

希望这可以帮助！

perl - 基于位置提取fasta序列

5 回答 5

Related

Reference