0

我对这段代码的总体目标是在每个 DNA 序列中找到基序(较小的序列),该序列将根据对数赔率评分矩阵报告最大对数赔率得分。我正在搜索的 .txt 文件如下所示:

>Sequence 1
TCGACGATCAGACAG
>Sequence 2
TGGGACTTGCACG

.... 等等。

我目前正在研究我的代码的最大化步骤,并且我正在努力计算我的 DNA 序列中一个基序的对数几率得分。我有创建对数赔率评分矩阵的代码 - 请参见以下内容:

#!/usr/bin/perl -w
#Above line used in Unix and Linux
#Grome Programming Assignment 2
#E-M
use strict;
use warnings;

# usage: script.pl {motif_width} {dnafile}
#USER SPECIFICATIONS
print "Please enter the filename of the fasta sequence data: ";
my $filename1 = <STDIN>;

#Remove newline from file
chomp $filename1;

#Open the file and store each dna seq in hash
my %id2seq = ();
my $id = '';
open (FILE, '<', $filename1) or die "Cannot open $filename1.",$!;
my $dna;
while (<FILE>)
{
    if($_ =~ /^>(.+)/)
    {
        $id = $1;
    }
    else
    {
        $id2seq{$id} .= $_;
    }
}
close FILE;
foreach $id (keys %id2seq)
{
    print "$id2seq{$id}\n\n";
}

#User specifies motif width
print "Please enter the motif width:\n";
my $width = <STDIN>;

#Remove newline from file
chomp $width;

#Default width is 3 (arbitrary number chosen)
if ($width eq '')
{
    $width = 3;
}
elsif ($width <=0)
{
    print "Please enter a number greater than zero:\n";
    $width = <STDIN>;
    chomp $width;
}

#User specifies number of initial random
#starting alignments
print "Please enter the number of initial random
starting alignments:\n";
my $start = <STDIN>;

#Remove newline from file
chomp $start;

#Default start is 50
if ($start eq '')
{
    $start = 50;
}
elsif ($start <=0)
{
    print "Please enter a number greater than zero:\n";
    $start = <STDIN>;
    chomp $start;
}

#User specifies number of iterations to
#perform expectation-maximization
print "Please enter the number of iterations for 
expectation-maximization:\n";
my $iteration = <STDIN>;

#Remove newline from file
chomp $iteration;

#Default iteration = 500
if($iteration eq '') 
{
    $iteration = 500;
}
elsif ($iteration <=0)
{
    print "Please enter a number greater than zero:\n";
    $iteration = <STDIN>;
    chomp $iteration;
}

#EXPECTATION
#Initialize counts for motif positions
#Incorporate pseudocounts initially
my %mot = map { $_ => [ (1) x $width ] } qw( A C G T );

# Initialize background counts
my %bg = map { $_ => 0 } qw( A C G T );

#Fill background and motif counts
foreach $id (keys %id2seq)
{
    #Generate random start site in the sequence
    #for motif to start from
    my $ms = int(rand(length($id2seq{$id})-$width));

    # Within a motif, count the bases at the positions
    for my $pos (0..length($id2seq{$id})-1)
    {
        my $base = substr($id2seq{$id}, $pos, 1);
        if ($pos >= $ms && $pos < $ms + $width) 
        {
            ++$mot{$base}[$pos-$ms]
                if exists($mot{$base});
        } 
        else 
        {
            ++$bg{$base}
            if exists($bg{$base});
        }
    }
}
#Print the background and motif counts
for my $base (qw( A C G T )) 
{
   print "$base @{$mot{$base}}\n";
}

print "\n";

for my $base (qw( A C G T )) 
{
   print "bg$base = $bg{$base}\n";
}

#Create frequency table of the motifs
#Get sum of the background
my $bgsum = 0;
for my $base (qw( A C G T))
{
    $bgsum = $bgsum + $bg{$base};
}
print "\n$bgsum\n\n";

#Create background frequency table
my %bgfreq = map { $_ => 0 } qw( A C G T );
for my $base (qw( A C G T))
{
    $bgfreq{$base} = $bg{$base} / $bgsum;
    print "bgfreq$base = $bgfreq{$base}\n";
}

#Get sum of each motif position
my @motsum = ( (0) x $width );
for my $base (qw( A C G T))
{
    for my $arrpos (0.. ($width-1))
    {
        $motsum[$arrpos] = $motsum[$arrpos] + @{$mot{$base}}[$arrpos];
    }
}

#Create motif frequency table
my %motfreq = map { $_ => [ (0) x $width ]} qw( A C G T );
for my $base (qw( A C G T))
{
    for my $arrpos (0.. ($width-1))
    {
        $motfreq{$base}[$arrpos] = $mot{$base}[$arrpos] / $motsum[$arrpos];
    }
    print "motfreq$base @{$motfreq{$base}}\n";
}

#Create odds table of motifs
my %odds = map { $_ => [ (0) x ($width) ]} qw( A C G T );

for my $base (qw( A C G T))
{
    for my $arrpos (0.. ($width-1))
    {
        $odds{$base}[$arrpos] = $motfreq{$base}[$arrpos] / $bgfreq{$base};
    }
    print "odds$base @{$odds{$base}}\n";
}

#Create log-odds table of motifs
my %logodds = map { $_ => [ (0) x ($width) ]} qw( A C G T );
for my $base (qw( A C G T))
{
    for my $arrpos (0.. ($width-1))
    {
        $logodds{$base}[$arrpos] = log2($odds{$base}[$arrpos]);
    }
    print "logodds$base @{$logodds{$base}}\n";
}
#####################################################
sub log2
{
    my $n = shift;
    return log($n)/log(2);
}

现在,我需要计算每个 DNA 序列中基序的对数优势得分。然后,我将遍历序列中所有可能的位置,并找到每个序列的最大分数。未来的工作需要我用最高分数回忆主题,但我还没有尝试过(只是想为这些最高分数提供范围)。

策略:我将创建一个对数赔率分数和最大分数的哈希,以保持每个序列的最大分数作为迭代。为了计算对数赔率分数,我将查看对数赔率评分矩阵的哪个元素与主题中的元素匹配。

#MAXIMIZATION
#Determine location for each sequence that maximally
#aligns to the motif pattern
#Calculate logodds for the motif

#Create hash of logodds scores and hash of maxscores
#so each id has a logodds score and max score
 my %loscore = map { $_ => [ (0) x (length($id2seq{$id})-$width) ]} qw( $id ); #Not sure if $id is correct, but I want a loscore for each $id
 my %maxscore = map { $_ => [ (0) x (length($id2seq{$id})-$width) ]} qw( $id ); #Not sure if $id is correct, but I want a maxscore for each $id
 foreach $id (keys %loscore, %maxscore, %id2seq)
 {
 my $len = length($id2seq{$id});
 for my $base (qw( A C G T ))
 {
     for my $pos (0..$len-1)
     {
         if ($id2seq{$id}[$pos] = $mot{$base})
         {
             for my $motpos (0..$width-1)
             {
                 $loscore{$id} = $loscore{$id} + $logodds{$base}[$motpos];
                 if ($loscore{$id} > $maxscore{$id})
                 {
                     $maxscore{$id} = $loscore{$id};
                 }
             }
         }
     }
 }
 print "$id2seq{$id} logodds score: $maxscore{$id}\n";
 }

#####################################################
sub log2
{
    my $n = shift;
    return log($n)/log(2);
}

当我从代码中取消注释最大化步骤并运行它时,最大化部分不会打印任何内容。我没有收到任何错误,但没有新的打印。我知道我的期望步骤可以简化(我会在它工作后清理所有内容),但我首先专注于这个最大化步骤。我知道最大化代码中有很多缺陷,尤其是在尝试创建对数赔率分数和最大分数哈希值时。任何和所有输入都有帮助!提前感谢您的耐心和建议。如果您需要任何澄清,请告诉我。

4

1 回答 1

0

foreach $id (keys %loscore, %maxscore, %id2seq)不会遍历所有三个哈希。你可能是说foreach $id (keys %loscore, keys %maxscore, keys %id2seq)

if ($id2seq{$id}[$pos] = $mot{$base})将值分配给$id2seq{$id}[$pos]。要检查你需要的平等if ($id2seq{$id}[$pos] eq $mot{$base})。但这也可能是错误的,因为$id2seq{$id}应该是一个字符串并且$mot{$base}是一个数组。

目前尚不清楚您的代码应该如何工作,并且很难在如此长的代码示例中找到错误。

于 2019-03-20T14:55:41.057 回答