-1

我有一个 300GB 的文件,我需要一些如下所示的行。从下面显示的行中,我只需要以 . 开头的行>miR

我已经编写了一个 Perl 程序,它实际上打印了我想要的输出,但是当我将相同的代码应用于高达 300 GB 数据的更大文件(如下所示的类似行)时,如何进行呢?是否有任何替代方法可以在此代码中实现,因为如果代码运行,它就会被杀死。

#!/usr/bin/perl -w
$len=@ARGV;
if($len eq 0){
    print "Give file \n";
    exit;
}
$file=$ARGV[0];
open(FH,$file) || die "cant open file\n";
@lines=<FH>;
close FH;
while ($line=<FH>){
    chomp $line;
    if ($line =~ /^>miR/){
        $_=$line;
        s/>//g && s/,//g;
        print "$_\n";
        if($_=~ /(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)/){
            print $1,"\t",$2,"\t",$7,"\t",$3,"\n";
        }

.

Forward:    Score: 124.000000  Q:2 to 18  R:1 to 20 Align Len (17) (64.71%) (82.35%)

   Query:    3' gaauAUUCGUUAG-AAUGGUAa 5'
                    |:: :|||| || |||| 
   Ref:      5' --ctTGGTTAATCATTCCCATt 3'

   Energy:  -10.480000 kCal/Mol

Scores for this hit:
>miR844a    AT2G33810,  124.00  -10.48  2 18    1 20    17  64.71%  82.35%


   Forward: Score: 120.000000  Q:2 to 19  R:289 to 308 Align Len (17) (64.71%) (76.47%)

   Query:    3' gaaUAUUCGUUAGAAUGGUAa 5'
                   ||::| ||  || |||| 
   Ref:      5' ttgATGGG-AAAATTTCCATt 3'

   Energy:  -9.850000 kCal/Mol

Scores for this hit:
>miR844a    AT2G33810,  120.00  -9.85   2 19    289 308 17  64.71%  76.47%


   Forward: Score: 118.000000  Q:2 to 19  R:483 to 503 Align Len (17) (64.71%) (82.35%)

   Query:    3' gaaUAUUCGUUAGAAUGGUAa 5'
                   :||:  |||| ||:||| 
   Ref:      5' gggGTAGAAAATCATATCATa 3'
4

1 回答 1

2

我们可以设置local $/ = '>'(作为记录分隔符),然后使用如下:

use Modern::Perl;

{
    local $/ = '>';
    while (<DATA>){
        next if !/^miR/;
        s/,//g;
        my($var0, $var1, $var2, $var6) = (split ' ', $_, 8)[0..2, 6];
        say"$var0,\t$var1,\t$var6,\t$var2";
    }
}


__DATA__
>miR844a    AT2G33810,  124.00  -10.48  2 18    1 20    17  64.71%  82.35%


   Forward: Score: 120.000000  Q:2 to 19  R:289 to 308 Align Len (17) (64.71%) (76.47%)

   Query:    3' gaaUAUUCGUUAGAAUGGUAa 5'
                   ||::| ||  || |||| 
   Ref:      5' ttgATGGG-AAAATTTCCATt 3'

   Energy:  -9.850000 kCal/Mol

Scores for this hit:
>moR844a    AT2G33810,  120.00  -9.85   2 19    289 308 17  64.71%  76.47%


   Forward: Score: 118.000000  Q:2 to 19  R:483 to 503 Align Len (17) (64.71%) (82.35%)

   Query:    3' gaaUAUUCGUUAGAAUGGUAa 5'
                   :||:  |||| ||:||| 
   Ref:      5' gggGTAGAAAATCATATCATa 3'
>miR844a    AT2G33810,  120.00  -9.85   2 19    289 308 17  64.71%  76.47%


   Forward: Score: 118.000000  Q:2 to 19  R:483 to 503 Align Len (17) (64.71%) (82.35%)

   Query:    3' gaaUAUUCGUUAGAAUGGUAa 5'
                   :||:  |||| ||:||| 
   Ref:      5' gggGTAGAAAATCATATCATa 3'

输出:

miR844a,    AT2G33810,  1,  124.00
miR844a,    AT2G33810,  289,    120.00

如果当前记录不以“miR”开头,则请求下一条记录(以“>”开头的记录),否则删除任何逗号,然后拆分记录以获取您所追求的值(来自您的正则表达式)。

希望这可以帮助!

于 2012-05-23T17:03:52.670 回答