我为以下生物信息学问题编写了一个 Perl 脚本,但不幸的是,输出存在问题。
问题
1)从40,000个唯一序列的文件中,唯一的意思是序列ID号,提取以下模式
$gpat = [G]{3,5}; $npat = [A-Z]{1,25};<br>
$pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat;
2)对于每个序列,查找是否$pattern
出现在的值之间
- 0-100
- 100-200
- 200-300
- ...
- 900-1000
1000
如果某个序列的长度小于 1000 个字符,则即使如此,也必须保持除法,即 0-100,100-200 等。
问题
我遇到的主要问题是计算每个序列细分出现 $pattern 的次数,然后为所有序列添加它的计数。
例如,对于序列 1,假设 $pattern 在长度 >1000 处出现 5 次。对于序列 2,假设 $pattern 在长度>1000 时出现 3 次。那么总数应该是 5+3 =8。
相反,我的结果是这样的: (5+4+3+2+1) + (3+2+1) = 21 即累计总数。
对于每个 100 个字符的前 10 个细分,我面临同样的问题。
如果可以为此计算提供正确的代码,我将不胜感激。
我写的代码如下。它很大程度上来自 Borodin 对我之前的一个问题的回答:Perl: Search a pattern across array elements
他的答案在这里:https ://stackoverflow.com/a/11206399/1468737
代码:
use strict;
use warnings;
my $gpat = '[G]{3,5}';
my $npat = '[A-Z]{1,25}';
my $pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat;
my $regex = qr/$pattern/i;
open my $fh, '<', 'small.fa' or die $!;
my ($id, $seq);
my @totals = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0); #intialize the @total arrays...
#..it should contain 10 parts for 10 divisions upto 1000bp
my @thousandcounts =(0); #counting total occurrences of $pattern at >1000 length
while (<$fh>) {
chomp;
if (/^>(\w+)/) {
process_seq($seq) if $id;
$id = $1;
$seq = '';
print "$id\n";
}
elsif ($id) {
$seq .= $_;
process_seq($seq) if eof;
}
}
print "Totals : @totals\n";
print "Thousand Counts total : @thousandcounts\n";
##**SUBROUTINE**
sub process_seq {
my $sequence = shift @_;
my $subseq = substr $sequence,0,1000;
my $length = length $subseq;
print $length,"\n";
if ($length eq 1000) {
my @offsets = map {sprintf '%.0f', $length * $_/ 10} 1..10;
print "Offsets of 10 divisions: @offsets\n";
my @counts = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
my @count = (0);
while ($sequence =~ /$regex/g) {
my $place = $-[0];
print $place,"\n\n";
if ($place <=1000){
for my $i (0..9) {
next if $place >= $offsets[$i];
$counts[$i]++;
last;
}
}
print "Counts : @counts\n\n";
$totals[$_] += $counts[$_] for 0..9;
if ($place >1000){
for my $i(0){
$count[$i]++;
last;
}
} print "Count greater than 1000 : @count\n\n";
$thousandcounts[$_] += $count[$_] for 0;
}
}
#This region of code is for those sequences whose total length is less than 1000
#It is working great ! No issues here
elsif ($length != 1000) {
my $substr = join ' ', unpack '(A100)*', $sequence;
my @offsets = map {sprintf '%.0f', $length * $_/ ($length/100)} 1..10;
print "Offsets of 10 divisions: @offsets\n";
my @counts = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0,);
while ($sequence =~ /$regex/g) {
my $place = $-[0];
print "Place : $place","\n\n";
for my $i (0..9) {
next if $place >= $offsets[$i];
$counts[$i]++; .
last;
}
}
print "Counts : @counts\n\n";
$totals[$_] += $counts[$_] for 0..9;
}
}#subroutine ends
我还附上了我正在使用的文件的一小部分。这个文件的标题是small.fa
,我一直在尝试使用这个文件,然后才转到包含超过 40,000 个序列的更大文件。
>NR_037701 1
aggagctatgaatattaatgaaagtggtcctgatgcatgcatattaaaca
tgcatcttacatatgacacatgttcaccttggggtggagacttaatattt
aaatattgcaatcaggccctatacatcaaaaggtctattcaggacatgaa
ggcactcaagtatgcaatctctgtaaacccgctagaaccagtcatggtcg
gtgggctccttaccaggagaaaattaccgaaatcactcttgtccaatcaa
agctgtagttatggctggtggagttcagttagtcagcatctggtggagct
gcaagtgttttagtattgtttatttagaggccagtgcttatttagctgct
agagaaaaggaaaacttgtggcagttagaacatagtttattcttttaagt
gtagggctgcatgacttaacccttgtttggcatggccttaggtcctgttt
gtaatttggtatcttgttgccacaaagagtgtgtttggtcagtcttatga
cctctattttgacattaatgctggttggttgtgtctaaaccataaaaggg
aggggagtataatgaggtgtgtctgacctcttgtcctgtcatggctggga
actcagtttctaaggtttttctggggtcctctttgccaagagcgtttcta
ttcagttggtggaggggacttaggattttatttttagtttgcagccaggg
tcagtacatttcagtcacccccgcccagccctcctgatcctcctgtcatt
cctcacatcctgtcattgtcagagattttacagatatagagctgaatcat
ttcctgccatctcttttaacacacaggcctcccagatctttctaacccag
gacctacttggaaaggcatgctgggtctcttccacagactttaagctctc
cctacaccagaatttaggtgagtgctttgaggacatgaagctattcctcc
caccaccagtagccttgggctggcccacgccaactgtggagctggagcgg
gagggaggagtacagacatggaattttaattctgtaatccagggcttcag
ttatgtacaacatccatgccatttgatgattccaccactccttttccatc
tcccagaagcctgctttttaatgcccgcttaatattatcagagccgagcc
tggaatcaaactgcctctttcaaaacctgccactatatcctggctttgtg
acctcagccaagttgcttgactattctcagtctcagtttctgcacctgtc
aaatagggtttatgttaacctaactttcagggctgtcaggattaaatgag
catgaaccacataaaatgtttggtgtatagtaagtgtacagtaaatactt
ccattatcagtccctgcaattctatttttcttccttctctacacagcccc
tgtctggctttaaaatgtcctgccctgctttttatgagtggataccccca
gccctatgtggattagcaagttaagtaatgacactcagagacagttccat
ctttgtccataacttgctctgtgatccagtgtgcatcactcaaacagact
atctcttttctcctacaaaacagacagctgcctctcagataatgttgggg
gcataggaggaatgggaagcccgctaagagaacagaagtcaaaaacagtt
gggttctagatgggaggaggtgtgcgtgcacatgtatgtttgtgtttcag
gtcttggaatctcagcaggtcagtcacattgcagtgtgtcgcttcacctg
gctccctcttttaaagattttccttccctctttccaactccctgggtcct
ggatcctccaacagtgtcagggttagatgccttttatgggccacttgcat
tagtgtcctgatagaggcttaatcactgctcagaaactgccttctgccca
ctggcaaagggaggcaggggaaatacatgattctaattaatggtccaggc
agagaggacactcagaatttcaggactgaagagtatacatgtgtgtgatg
gtaaatgggcaaaaatcatcccttggcttctcatgcataatgcatgggca
cacagactcaaaccctctctcacacacatacacatatacattgttattcc
acacacaaggcataatcccagtgtccagtgcacatgcatacacgcacaca
ttcccttcctaggccactgtattgctttcctagggcatcttcttataaga
caccagtcgtataaggagcccaccccactcatctgagcttatcaaccaat
tacattaggaaagactgtatttcctagtaaggtcacattcagtagtactg
agggttgggacttcaacacagctttttgggggatcataattcaacccatg
acagccactgagattattatatctccagagaataaatgtgtggagttaaa
aggaagatacatgtggtacaaggggtggtaaggcaagggtaaaaggggag
ggaggggattgaactagacacagacacatgagcaggactttggggagtgt
gttttatatctgtcagatgcctagaacagcacctgaaatatgggactcaa
tcattttagtccccttctttctataagtgtgtgtgtgcggatatgtgtgc
tagatgttcttgctgtgttaggaggtgataaacatttgtccatgttatat
aggtggaaagggtcagactactaaattgtgaagacatcatctgtctgcat
ttattgagaatgtgaatatgaaacaagctgcaagtattctataaatgttc
actgttattagatattgtatgtctttgtgtccttttattcatgaattctt
gcacattatgaagaaagagtccatgtggtcagtgtcttacccggtgtagg
gtaaatgcacctgatagcaataacttaagcacacctttataatgacccta
tatggcagatgctcctgaatgtgtgtttcgagctagaaaatccgggagtg
gccaatcggagattcgtttcttatctataatagacatctgagcccctggc
ccatcccatgaaacccaggctgtagagaggattgaggccttaagttttgg
gttaaatgacagttgccaggtgtcgctcattagggaaaggggttaagtga
aaatgctgtataaactgcatgatgtttgcaggcagttgtggttttcctgc
ccagcctgccaccaccgggccatgcggatatgttgtccagcccaacacca
caggaccatttctgtatgtaagacaattctatccagcccgccacctctgg
actccctcccctgtatgtaagccctcaataaaaccccacgtctcttttgc
tggcaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaa
>NR_002714 1
gttatacatctctaccattacctagcctgaaaagccacctcagattcagc
caacaagtaagtgggcattacaggagaagggtacctttcacaagggctgt
aatctaaaatcttggggaagatacagcgtcatctgtccaagaggtgtcag
cagtaacgaagcctcagtagaagccaaagttattttggattactgagcct
gtatagtttccagattctcaagagaaatatatgggaatgtagatatctca
gaggaccttcctgctgtcaggaattcagaggaggaaataaggaaggtaat
aggtgctctgctctcattctctcaaaccctcttccctgtgttttcctata
gagattgctgatttgctccttaagcaagagattcactgctgctcagcatg
gctcagaccaactcatgcttcatgctgatctcctgcctgatgttcctgtc
tctgagccaaggtgagattgttttccccacacatacctcccacaacccca
gccctgaagccctcactctatcctcatgcatatgagttcacttgagaaaa
agcagagtcaagttcaggggttgttttgtgttgttcagtgatatttattg
ctgatctcatcccattcaaaaacatcctgacctccctaaggagttagaga
tggaacttagcataaccctttatcagtgaccactgcagttggcattggtt
tgtcatattaacactactcatgatgggggtgttgaggatgtctgtttgta
gacagtcattagtggaatggggaactgaggggagctttgtgtgtagagaa
actggacaggcttgagaaagaagcctcagtccttcaaggaagaaaaagcc
ataagtaaaagggacaatggggacacttttcatgagcctattcattgtgt
gctcttgtcttgagcaaagacatcttgagagcctataggtaagatgcaga
agggcagaagtgaccaatcgcttcgtgacctataggatccttctattcct
ataaagaatcctcagaagctcctacctcatattttagcctttaccttgcc
ctgagggtctttcttaattgtctctcttttcccaggacaggaggcccatg
ctgagttgcccaaggcccagatcagctgcccagaaggcaccagtgcctaa
ggctcccactgctactactttaatgaagagcatgagacctgggtttatgc
agatgtgagtgaggagagcagtgtgggaagggaggctcacgaagggaggg
gaagctgccactctccagtgtgttcagtggctgatatgagatgagactaa
tcccctccctatccaatcatcagcccaaaactttccaatctactttatcc
catcattcagcacagagatgctggtggtcagtgacagcatcatcagggac
atttctgtgctgtcctttttctgttacatcctctgggagggctcaatatg
tctcccacactttcctccttcactgagtgctccattttcttctccaacag
ctctactgccagaacatgaattcaggtaacctggtgtctgtgctcaccca
ggctgagggtgcctttgtggcttcgctgattaaagagagtggcaccaagg
atagcaatgtctggattggcctccatgacccccaccggatcagtctgctg
catcttctacctcctgattatcaggttccagagggtctgatgtctggcac
ctcaagcatcagtttttactatattatgataaaagcaacctctctataaa
tcatataatgtaaaggatatcaaggttctccataggttcttcgagataag
cttaaagctgaatttcctgtgtgtttcaggcattcacagataaactcatt
ctctgtacttctagggtagcatctttatgtatctattatgtacctcttat
ctattgtgttatcatctctgttatagaagagccttctgtagaccatatag
aaaaagattatagaggaggagaatctactgctggcaattgggaaccgcaa
ggtatactaaataatatatcaacaactaatggccatctaatgctatgctg
gatatgaacttttggggcctcaggaaagaaaaaccaggaactagtttcaa
taatgaggtgtcatggttccctgtggcaaatttagaacgcttatcgtttg
gcaggacacagagaggtaggtgaacattccaggaaagaagcagcttagag
aaaatgtggaggaaataatatgacacttagagaaaaaggaaggtttattc
ttgtcttatgtcttgacctgtttctgagtgcgaacacaaaccaggtgttt
ctgtctctttctgagtcacgtctgcccctgttctggcccttccccatcta
gaactgccattatcagtggagtagtgggtccctggtctcctacaaatcct
gggacattggatccccaagctgtgccaatactgcctactgtgctagcctg
acttcaagctcaggtgaggggcacagaatccacacacttattgccatcct
ctcctatttatctctgaggatcgaccggggactgggatagaggaagggtg
agctcctcattcaggaaatagaggagtgtttcctctttatttttgctgag
tcctgcagccaggagggtaatacactctgatcccctcagtctgaatcttc
tcattgtcttataggattcaagaaatggaaggatgattcttgtaaggaga
agttctcctttgtttgcaagttcaaatactggaggcaattgtaaaatgga
cgtctagaattggtctaccagttactatggagtaaaagaattaaactgga
ccatctctctccatatcaatctggaccatctctcctctgctaaatttgca
tgactgatctttagtatctttacctacctcaatttctggagccctaaaca
ataaaaataaacatgtttcccccat
>NR_003569 1
ctgggacccacgacgacagaaggcgccgatggccgcgcctgctgagccct
gcgcggggcagggggtctggaaccagacagagcctgaacctgccgccacc
agcctgctgagcctgtgcttcctgagaacagcaggggtctgggtaccccc
catgtacctctgggtccttggtcccatctacctcctcttcatccaccacc
atggccggggctacctccggatgttccccactcttcaaagccaagatggt
gcttggattcgccctcatagtcctgtgtacctccagcgtggctgtcgctc
tttggaaaatccaacagggaacgcctgaggccccagaattcctcattcat
cctactgtgtggctcaccacgatgagcttcgcagtgttcctgattcacac
caagaggaaaaagggagtccagtcatctggagtgctgtttggttactggc
ttctctgctttgtcttgccagctaccaacgctgcccagcaggcctccgga
gcgggcttccagagcgaccctgtccgccacctgtccacctacctatgcct
gtctctggtggtggcacagtttgtgctgtcctgcctggcggatcaacccc
ccttcttccctgaagacccccagcagtctaacccctgtccagagactggg
gcagccttcccctccaaagccacgttctggtgggtttctggcctggtctg
gaggggatacaggaggccactgagaccaaaagacctctggtcgcttggga
gagaaaactcctcagaagaacttgtttcccggcttgaaaaggagtggatg
aggaaccgcagtgcagcccgggggcacaacaaggcaatagcatttaaaag
gaaaggcggcagtggcatggaggctccagagactgagcccttcctacggc
aagaagggagccagtggcgcccactgctgaaggccatctggcaggtgttc
cattctaccttcctcctggggaccctcagcctcgtcatcagtgatgtctt
caggttcactgtccccaagctgctcagccttttcctggagtttattggtg
atcccaagcctccagcctggaagggctacctcctcgccgtgctgatgttc
ctctcggcctgcctgcaaacgctgtttgagcagcagaacatgtacaggct
caaggtgctgtagatgaggctgcggtcggccatcactggcctggtgtaca
gaaaggcatccacagcatatctgaagaaatattcagaagttaactaatct
cagatgatttcagcaggagtaaagaagagaaacagactcagaaatgccat
tacaacagttaattatgtcaaatttatcaccctgattgatcacgcagcat
taacctcaagaacgccaagccaagtttttttgacaaatgtgagccaaggt
ttccgaaaaactagcagatatgactgtgacttacaaaatggaaaaagtaa
acgagaaacacaatttgatatgatttaataaaagatttgtttccaccact
tctcctgggaacctcagcacattttctttccactgacagttattatctct
acctttattgaacaaagacacccggaacacagctgctgaggatcagtaaa
gaaaatcattcttttattaataagactgttattagcaggaaaaaaaaatc
catgtttgggagtttgcactgaagttacaggccattttgaagaaatatgg
ctgactagtgccaacattatttcaggcaatttcatgatcaaatgtcttat
taggttgtttaaaatttttatagagattgtaaatcagaactattttctat
ttgccctaaatatttagatgctacagggaaagcagatcaaattaaagggt
actgtgcacatttttttactgggaactcccagggatataaatcatttcgc
ctgcagcatggaattcttcagtacacatgcttgtggaaacattccacgct
ccgccagcacgctcattaaagtgatgatttgggttgcaacaacagtgcca
agtacttcctgtgttcaactggggaccatgtggcaagacccaaagcttcc
ccagagatcctatgggaataagttttttgagccaccatattccattattt
cagcctaaaataacaccatgggacaagaatcagaagacagaggagcagac
aaatgtgtgtagacatgctggaaggaatctttctttttagaaacagggtc
aatatctattaaactttaagatgtgtatctcttgacctggcagtttctgt
atttgagttttaacctactgatatacccatgcatgtgaataaagtatctt
cctgcatgtaacaggatatttaatgtaaccttgattatagttgcaaatgc
tgggaaacgatccaaatgtctttcaatatggcactgattaaataaattat
ggcacagtctcacaatgaaaaacaaatgtagccattaaacagaatgaaat
gggtctagctaaattgaaataggactacctctaagatatgttgttaaaaa
gaaaaaaaagaaagtgcagaggaacaagtatgataccattttgtattttt
taacatatgcaagcgtgattgtgcccacacagaatacctttgaaaataaa
ctcagtatttgcctcagtggataaaaacaagaaccagccttattttcact
gttatatcttttggtgccactttttgaactttttaccatatgtgcatatg
taactttctaaataaattttgtaaaaaaaaaaaaaaaaaa
>NR_002817 2
aactcggtctccactgcactgctggccagacgagggatgttattttgggc
agtgcatctggacttggttcaagtggcaccagccaaatccctgccttact
gacctctcccctggaggagcaggagcagtgctcaaggccgccctgggagg
gctgagaggcaggctctggactggggacacagggatagctgagccccagc
tgggggtggaagctgagccagggacagtcacagaggaacaagatcaagat
gcgctttaactgagaagcccccaaggcagaggctgagaatcagaagacat
ttcagcagacatctacaaatctgaaggacaaaacatggttcaagcatctg
ggcacaggcggtccacccgtggctccaaaatggtctcctggtccgtgata
gcaaagatccaggaaatatggtgcgaggaagatgagaggaagatggcgcg
agagttcctggccgagttcatgagcacatatgtcatgatggagtggctga
ccgggatgctccagctgtgtctcttcgccatcgtggaccaggagaacaac
ccagcactgccaggaacacacgcactggtgataggcatcctcgtggtcat
catcagggtgtaccatggcatgaacacaggatatgccatcaatccgtccc
gggacctgccccccccccccgcatcttcaccttcattgctggttggggca
aactggtcttcaggtactgcccctgcccaggcccattcctttgagatttt
ctgtggggcccctgtgtgttgaggtgtggggggtgatgtgaggggcagca
caggagggtcctgcagagcccccaggtggcctggggagcaggagtgagtc
ccaacatttccccaggccagtagagatacagatcctgcacctgcactgag
tgtcaaccctgtccctgagtcgggctgaggctgaccagggccccgggttg
ggggtgtttcctgggttagcctgaggatgactcctctgctcaaccagtct
tggcccgaggtggatgagggtgctgtcctgggcatcagccccctcagccg
gcctctgcctcttgcctgcagcgatggggagaacttgtggtgggtgccag
tggtggcaccacttctgggtgcctctctaggtggcatcatctacctggtc
ttcattggctccaccatcccacgggagcccctgaaattggaggactctgt
ggcatatgaagaccacgggataaccgtattgcccaagatgggatctcatg
aacccatgatctctccccttaccctcatctccgtgagccctgccaacaga
tcttcagtccaccctgccccacccttacatgaatccatggccctagagca
cttctaagcagagattatttgtgatcccatcccttccccaataaagagaa
gcttgtcccacagcagtacccccacttcctgggggcctcctgtggttggg
cttccctcctgggttcttccaggagctctagggctatgtcttagcccaag
gtgtagaggtgaggcacctcaagtctttcatgccctgggaactggggtgc
cccagggggagaatggggaagagctgacctgcgccctcagtaggaacaag
gtaagatgaaagaatgacagaaacagaatgagggattttcaggcaagggg
gaaggaagggcagttttggtgaaaggactgtagctgactggtggggggct
ggctttggaaatactttgaggggatcctgagactggactctagactctcc
cctggttgttcccttccccgagttctggccggttcttggaccagacaagg
catggcccaagaaggtagatcagaattttttagcctttttttcattagtg
ccttccctagtataattccagattttttttcttaatcacatgaaatttta
ataccacagatatactatacatctgtttatgttctgtatatgttctgtgc
tttatacgtaaaaaagagtaagattttttttcacctccccttttaagaat
cagttttaattcccttgagaatgcttgttatagattgaaggctggtaagg
ggttgggctcctctttcttcttcctggtgccagagtgctcccacatgaag
gaataggaaaggaagatgcaaagagggaaatccttcgaacacatgaagac
acaggaagaggcctcttagggctccaagggctccagggaagcagctgcag
aggttgggtggggtgaggggccaggatccactgaccctggggccaggcag
gaatcactctgttgcctggggctcagaaggcagtatcacccatggttcct
gtcattgctcatgtattttgcctttcaacaattattgtgcacctactgtg
tgcaggccctgcctggacactggggatgcgcagtggatgcactgggctct
gcctttgagggttgcagtttaatgggtgacaggtaattataaggaagaag
gtgagtgcagagtgggaggcttggaggctgtggggcttggggtgggggag
ctcacatccagcctctgggccaaggccaggaggcttcccagagcaggaga
cagagcagggtattgtggtggggggtgtcctttttggggctgggatctgc
actttacagtttgaggggatgggcagaggaggctgggcttcattctggag
gtggggacatggtgaggtgaggtttagaaagcacacctgagccgcagtgt
gtaggatgctggaaatggtggagatgggcctgcgaagagagtgctgggaa
gtgatgacccaggagcagcagccgggcacctaacaatgggtcagcaccgt
gggcgtggagacaaaggccgggattgatcaatacccgagaagtacaatgt
acaggacttgggctccatttggatggagtgggtgagggaggagtcagaaa
tggcttccgatttccagcttgggcctggggattggagatgtccccactga
gagtagggcacaagtgaggaaatggtttggagaggaagatgataagttac
atcatggatgtgctgagtctgagttgcctatgggacttggaatggggggt
ggcaaaaggtgtgtgatcttgagcaagatattcaactcttctgggccttg
gtcttctcatttgtaaaacggtgataagaatattacttcccatttgtgtt
gctgtgaatattaaatgcgctaccacatgt
感谢您花时间解决我的问题。
任何帮助和输入将不胜感激。
感谢您花时间解决我的问题!