0

每个人的另一个问题。重申一下,我对 Perl 过程非常陌生,我为犯下愚蠢的错误提前道歉

我正在尝试计算不同长度 DNA 序列的 GC 含量。该文件采用以下格式:

>gene 1
DNA sequence of specific gene
>gene 2
DNA sequence of specific gene
...etc...

这是文件的一小部分

  >env
  ATGCTTCTCATCTCAAACCCGCGCCACCTGGGGCACCCGATGAGTCCTGGGAA 

我已经建立了计数器并读取了 DNA 序列的每一行,但目前它正在对所有行的总数进行运行总和。我希望它读取每个序列,在读取序列后打印内容,然后移动到下一个。每行都有单独的碱基计数。

这就是我到目前为止所拥有的。

#!/usr/bin/perl


#necessary code to open and read a new file and create a new one.
use strict;
my $infile = "Lab1_seq.fasta";
open INFILE, $infile or die "$infile: $!";
my $outfile = "Lab1_seq_output.txt";            
open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!";

#establishing the intial counts for each base
my $G = 0;
my $C = 0;
my $A = 0;
my $T = 0;

 #initial loop created to read through each line 
 while ( my $line = <INFILE> ) {
chomp $line;


# reads file until the ">" character is encounterd and prints the line
if ($line =~ /^>/){
    print OUTFILE "Gene: $line\n";

}
# otherwise count the content of the next line. 
# my percent counts seem to be incorrect due to my Total length counts skewing the following line. I am currently unsure how to fix that
elsif ($line =~ /^[A-Z]/){
    my @array = split //, $line;
    my $array= (@array);
 # reset the counts of each variable
    $G = (); 
    $C = ();
    $A = ();
    $T = ();
    foreach $array (@array){ 
 #if statements asses which base is present and makes a running total of the bases.     
    if ($array eq 'G'){
        ++$G;   
    }        
    elsif ( $array eq 'C' ) {
    ++$C; }
    elsif ( $array eq 'A' ) {
    ++$A; }
    elsif ( $array eq 'T' ) {
    ++$T; }

  } 
# all is printed to the outfile
    print OUTFILE "G:$G\n";
    print OUTFILE "C:$C\n";
    print OUTFILE "A:$A\n";
    print OUTFILE "T:$T\n";
    print OUTFILE "Total length:_", ($A+=$C+=$G+=$T), "_base pairs\n";
    print OUTFILE "GC content is(percent):_", (($G+=$C)/($A+=$C+=$G+=$T)*100),"_%\n";


}

}
#close the outfile and the infile
close OUTFILE; 
close INFILE;

我再次觉得我走在正确的道路上,我只是缺少一些基本的基础。任何帮助将不胜感激。

最后一个问题是打印出来的最终计数。我的百分比值是错误的,并且给了我错误的值。我觉得正在计算总数,然后将新值合并到总数中。

4

2 回答 2

1

几件事:1.使用哈希而不是声明每个元素。2. 分配例如$G = (0);确实有效,但它不是分配标量的正确方法。您所做的是声明一个数组,它在标量上下文$G =中返回第一个数组项。正确的方法是$G = 0

my %seen;
$seen{/^([A-Z])/}++ for (grep {/^\>/} <INFILE>);

foreach $gene (keys %seen) {
    print "$gene: $seen{$gene}\n";
}
于 2013-08-18T20:31:42.137 回答
0

只需在发现新基因时重置计数器。另外,我会使用哈希进行计数:

use strict; use warnings;

my %counts;
while (<>) {
  if (/^>/) {
    # print counts for the prev gene if there are counts:
    print_counts(\%counts) if keys %counts;
    %counts = (); # reset the counts
    print $_;     # print the Fasta header
  } else {
    chomp;
    $counts{$_}++ for split //;
  }
}
print_counts(\%counts) if keys %counts;  # print counts for last gene

sub print_counts {
  my ($counts) = @_;
  print "$_:=", ($counts->{$_} || 0), "\n" for qw/A C G T/;
}

用法:$ perl count-bases.pl input.fasta

示例输出:

> gene 1
A:=3
C:=1
G:=5
T:=5
> gene 2
A:=1
C:=5
G:=0
T:=13

风格评论:

opening 一个文件时,总是使用词法文件句柄(普通变量)。另外,你应该做一个三参数打开。我还推荐autodie用于自动错误处理的编译指示(从 perl v5.10.1 开始)。

use autodie;
open my $in,  "<", $infile;
open my $out, ">", $outfile;

请注意,我没有在上面的脚本中打开文件,因为我使用特殊的ARGV文件句柄进行输入,并打印到STDOUT. 输出可以在 shell 上重定向,比如

$ perl count-bases.pl input.fasta >counts.txt

用括号中的值声明标量变量my $G = (0)很奇怪,但工作正常。我认为这比有用更令人困惑。→ my $G = 0

你的意图有点奇怪。将右大括号与另一个语句放在同一行是非常不寻常且视觉上令人困惑的

...
elsif ( $array eq 'C' ) {
    ++$C; }

我更喜欢拥抱 elsif:

   ...
} elsif ($base eq 'C') {
    $C++;
}

该语句my $array= (@array);将数组的长度放入$array. 做什么的?提示:您可以在 foreach 循环中声明变量,例如for my $base (@array) { ... }.

于 2013-08-18T20:37:41.770 回答