perl - 根据条件跳转文件中的行块

Question

我有一个 .bedGraph 文件，如下所示：

chr     start   end     score
chr1    3000305 3000306 0.006
chr1    3000306 3000307 0.01
chr1    3000307 3000308 0.014
chr1    3000308 3000309 0.017
chr1    3000309 3000310 0.019
chr1    3000310 3000313 0.021
chr1    3000313 3000314 0.029
chr1    3000314 3000315 0.027
chr1    3000315 3000316 0.02
chr1    3000316 3000317 0.011

我必须编写一个脚本，它将遍历该文件并查找分数 >0.02，获取该分数的起始值，然后继续搜索，直到它达到分数 <0.02，此时它应该获取前一个结束位置。所以在这种情况下，程序应该从文件的开头开始遍历，确定第一个分数>0.02，获取该分数的起始位置=3000310，然后继续搜索，直到分数低于0.02，此时它应该获取上一个结束位置= 3000316。在此之后，它应该继续在文件中搜索这些块，并获取包含分数>0.02的块的开始和结束位置。再次，它应该不要抓取包含 score>0.02 的块的所有开始和结束，而只抓取此类块的第一个开始和最后一个结束。

我已经编写了部分代码，但不知道如何进一步进行：

open BEDGRAPH, $ARGV[0] or die print $!;

my $thresh=0.5;
my $j=1;
my $i=1;
my @arr = <BEDGRAPH>;
my @tmp;
for $i (0 .. $#arr)
{
my ($chr, $start, $end, $score) = split('\s',$arr[$i]);
if($score>=$thresh)
{
    push(@tmp,$chr);
    push(@tmp,$start);
    $j=$i+1;
    my ($chr1, $start1, $end1, $score1) = split('\s',$arr[$j]);
    while($score1>=$thresh)
    {
        $j=$j+1;
    }
    my ($chr2, $start2, $end2, $score2) = split('\s',$arr[$j-1]);
    push(@tmp,$end2);
    $i=$j+1;
    print @tmp;
}
elsif($score>=$thresh)
{
        $i=$i+1;
}
}

close(BEDGRAPH);

在这里，我试图在@tmp 中推送所需的开始和结束位置并打印它。

score 3 · Accepted Answer

一些帮助您入门的建议。

首先，为什么要遍历你的文件两次？将其读入数组时执行一次，然后在处理数组时再执行一次。为什么不只是在逐行读取文件时进行处理？

# Use a lexical filehandle and test `open` for failure
my $file = $ARGV[0];
open my $fh, "<", $file or die "Failed to open file '$file': $!";

while (<$fh>) {
    my ($chr, $start, $end, $score) = split;
...

请注意，我没有使用具有数组索引的列，而是使用有意义的变量名。

另外，避免像瘟疫这样的神奇数字，并将阈值放在变量中。这样，如果它确实从 0.02 更改为 0.5，您只需在代码中的一个位置更新它。对于阅读您的代码的人来说，变量名称也往往比幻数更有意义。

my $threshold = 0.02;

在阅读文件时，您需要跟踪一些信息。

您是否在一个区块内（即分数高于您的阈值的部分）？
如果你在一个块内，块start开头的值是多少？
如果你在一个块内，end上一行的值是多少？你需要这个，因为直到下一行你才发现你已经离开了一个块。

如果您考虑如何获得这些信息，您应该能够弄清楚其余部分。

编辑：您使用最新编辑完全更改了代码。这确实应该是一个新问题。

更新代码的直接问题：

open BEDGRAPH, $ARGV[0] or die print $!;

使用词法文件句柄 ( open my $fh) 而不是 typeglobs ( open FILE)，它们在范围内是全局的。

my @arr = <BEDGRAPH>;

您在评论中提到您正在处理非常大的文件，但您正在使用@array = <$fh>. 你真的应该使用while (<$fh>) ...

while($score1>=$thresh)
{
    $j=$j+1;
}

最后，您永远不会更改循环体的值$score或$thresh循环体中的值，因此它会永远运行。

score 2 · Accepted Answer

我认为您需要按照 ThisSuitisBlackNot 的说明对您的程序进行更改。我想我会发布一个可能的解决方案。

更新：如果 chr 名称可以更改，则此程序可能无法运行，需要进行调整。

#!/usr/bin/perl
use strict;
use warnings;

my ($prev_chr, $prev_start, $prev_end);
my $thresh = .02;

while (<DATA>) {
    my ($chr, $start, $end, $score) = split;

    if ($score >= $thresh) {
        $prev_chr   //= $chr;
        $prev_start //= $start;
        $prev_end = $end;
    }
    else {
        if ($prev_chr) {
            print "$prev_chr $prev_start $prev_end\n";
            ($prev_chr, $prev_start, $prev_end) = (undef) x 3;
        }
    }
}
print "$prev_chr $prev_start $prev_end\n" if $prev_chr;

__DATA__
chr1    3000305 3000306 0.006
chr1    3000306 3000307 0.01
chr1    3000307 3000308 0.014
chr1    3000308 3000309 0.017
chr1    3000309 3000310 0.019
chr1    3000310 3000313 0.021
chr1    3000313 3000314 0.029
chr1    3000314 3000315 0.027
chr1    3000315 3000316 0.02
chr1    3000316 3000317 0.011

score 0 · Accepted Answer

#!/usr/bin/perl
use warnings;
use strict;

open my $fh, '<', 'bedgraph.txt' or die "cant open bedgraph.txt $!";

my $thresh = 0.02;

my @start_pos;
my @end_pos;
my $previous_end;

my $header = <$fh>;

ABOVE:
while (<$fh>){
    my ($chr, $start, $end, $score) = split;
    if ($score > $thresh){
        push @start_pos, $start;
        $previous_end = $end;
        BELOW:
        while (<$fh>){ 
            my ($chr, $start, $end, $score) = split;
            if ($score < $thresh){
                push @end_pos, $previous;
                next ABOVE;  
            }
            $previous = $end;        
        }      
    }
}    

close $fh;

print "Start positions found: @start_pos\n";
print "End positions found: @end_pos\n";

#Start positions found: 3000310
#End positions found: 3000316

简要说明：

ABOVE 循环：扫描文件以寻找大于阈值的分数。当/如果找到一个值，它将当前行的起始列中的值存储在数组@start_pos中。然后它将控制权传递给由 BELOW 标识的循环。假设该值是在第 6 行找到的。
BELOW 循环：在第 7行开始扫描文件，寻找低于阈值的值。当/如果找到一个值，它将上一行的结束列中的值存储在数组@end_pos中。然后它将控制权传递给由 ABOVE 标识的循环。假设该值是在第 10 行找到的。
ABOVE 循环：在第 11 行开始扫描文件，并且该过程不断重复，直到文件的每一行都被读取。

score 0 · Accepted Answer

考虑以下：

use strict;
use warnings;

my $startFound = 0;
my $priorEnd;

while (<>) {
    $. > 1 or next;    # Get past the header
    my ( undef, $start, $end, $score ) = split;

    if ( $score > .02 and !$startFound ) {
        $startFound = 1;
        print "Start: $start\n";
    }

    if ( $score < .02 and $startFound ) {
        $startFound = 0;
        print "End  : $priorEnd\n";
    }

    $priorEnd = $end;
}

用法：perl script.pl inFile [>outFile]

最后一个可选参数将输出定向到文件。

此脚本设置一个标志 ( $startFound) 来表示一个块的开始，然后检查低于 0.02 的分数和该标志以找到该块的结束。var$priorEnd只保存最后一个“结束”值，用于开始/结束对。

希望这可以帮助！

perl - 根据条件跳转文件中的行块

4 回答 4

Related

Reference