-1

I'm kind of new to Perl and I am comparing 2 strings of different size containing DNA nucleotides. I want the script to take the smaller string and locate it in the much larger string allowing for mismatches and providing me with the sequence it found in the larger string plus adjacent 5 nucleotides on either side.

So for example if I have 2 strings:

#1  ATGATCCTG
#2  TCGAGTGGCCATGAACGTGCCAATTG

I want the script to take #1 and find the same sequence in #2 which is present but with 2 mismatches, along with 5 nucleotides on either side.

4

1 回答 1

1

我认为使用已经存在的模块并经过良好测试是完成此类任务的方法,众所周知 perl 有很多 bio 模块和用法,所以在 cpan 中快速搜索我可以找到Bio::Grep它可能是很好的帮助

编辑

可能吗?是的,有人以前做过,所以有可能,但我认为用简单的正则表达式做这件事并不容易

因为我不是生物专家,所以我尽量举一个简单的例子

use strict;
use warnings;
use Data::Dumper;

my $str1 = 'ATGATCCTG';
my $str2 =  'TCGAGTGGCCATGAACGTGCCAATTG';

my @s1 = split '', $str1;

my $miss = 0;
my $pattern = '';
for (@s1){
    my $r = $_;
    if ($str2 =~ /$pattern$r/){
        $pattern .= $r;
    } else {
        $miss++;
        $pattern .= '[ATCG]'
    }
}

##this is the pattern we used
print Dumper $pattern;

##withoud 5 nucleotides on both sides
#$str2 =~ m/($pattern)/g;

#5 nucleotides on both sides match pattern
$str2 =~ m/(\w{0,5}$pattern\w{0,5})/g;

##this is the match
print Dumper $1;

##number of missmatches
print Dumper $miss;

再一次,我不确定这是完全做到这一点的方法,而且绝对不是遵循大 DNA 序列的方法,但对于你上面的任务,我认为它还可以。

于 2013-08-08T15:59:59.387 回答