6

我有两个文件:

region.txt:第一列是染色体名称,第二列和第三列是开始和结束位置。

1  100  200
1  400  600
2  600  700

coverage.txt:第一列是染色体名称,第二列和第三列是开始和结束位置,最后一列是分数。

1 100 101  5
1 101 102  7 
1 103 105  8
2 600 601  10
2 601 602  15

这个文件非常大,大约 15GB,大约有 3 亿行。

我基本上想得到所有在regions.txt 中每个区域的coverage.txt 分数的平均值。

也就是说,从regions.txt的第一行开始,如果coverage.txt中有一行染色体相同,start-coverage>= start-region,end-coverage<= end-region,然后将其分数保存到一个新数组中。在所有 Coverages.txt 中完成搜索后,打印区域染色体、开始、结束和已找到的所有分数的平均值。

预期输出:

1  100 200 14.6   which is (5+7+8)/3
1  400 600 0      no match at coverages.txt
2  600 700 12.5   which is (10+15)/2

我构建了以下 MATLAB 脚本,该脚本需要很长时间,因为我必须多次遍历 coverage.txt。我不知道如何制作一个快速的 awk 类似脚本。

我的matlab脚本

fc = fopen('coverage.txt', 'r');
ft = fopen('regions.txt', 'r');
fw = fopen('out.txt', 'w');

while feof(ft) == 0

linet = fgetl(ft);
scant = textscan(linet, '%d%d%d');
tchr = scant{1};
tx = scant{2};
ty = scant{3};
coverages = [];

    frewind(fc);
    while feof(fc) == 0

    linec = fgetl(fc);
    scanc = textscan(linec, '%d%d%d%d');
    cchr = scanc{1};
    cx = scanc{2};
    cy = scanc{3};
    cov = scanc{4};


        if (cchr == tchr) && (cx >= tx) && (cy <= ty)

            coverages = cat(2, coverages, cov);

        end

    end

    covmed = median(coverages);
    fprintf(fw, '%d\t%d\t%d\t%d\n', tchr, tx, ty, covmed);

end    

如果有人可以教我如何摆脱 matlab 脚本中的所有循环,我会很高兴使用 AWK、Perl 或...等进行替代。

谢谢

4

4 回答 4

4

这是一个 Perl 解决方案。我使用散列(又名字典)通过染色体访问各种范围,从而减少循环迭代的次数。

这可能很有效,因为我不会regions.txt在每条输入线上都进行完整的循环。当使用多线程时,效率可能会进一步提高。

#!/usr/bin/perl

my ($rangefile) = @ARGV;
open my $rFH, '<', $rangefile    or die "Can't open $rangefile";

# construct the ranges. The chromosome is used as range key.
my %ranges;
while (<$rFH>) {
    chomp;
    my @field = split /\s+/;
    push @{$ranges{$field[0]}}, [@field[1,2], 0, 0];
}
close $rFH;

# iterate over all the input
while (my $line = <STDIN>) {
    chomp $line;
    my ($chrom, $lower, $upper, $value) = split /\s+/, $line;
    # only loop over ranges with matching chromosome
    foreach my $range (@{$ranges{$chrom}}) {
        if ($$range[0] <= $lower and $upper <= $$range[1]) {
            $$range[2]++;
            $$range[3] += $value;
            last; # break out of foreach early because ranges don't overlap
        }
    }
}

# create the report
foreach my $chrom (sort {$a <=> $b} keys %ranges) {
    foreach my $range (@{$ranges{$chrom}}) {
        my $value = $$range[2] ? $$range[3]/$$range[2] : 0;
        printf "%d %d %d %.1f\n", $chrom, @$range[0,1], $value;
    }
}

示例调用:

$ perl script.pl regions.txt <coverage.txt >output.txt

示例输入的输出:

1 100 200 6.7
1 400 600 0.0
2 600 700 12.5

(因为 (5+7+8)/3 = 6.66…)

于 2012-10-13T12:43:15.543 回答
1

Normally, I would load the files into R and calculate it, but given that one of them is so huge, this would become a problem. Here are some thoughts that might help you solving it.

  1. Consider splitting coverage.txt by chromosomes. This would make the calculations less demanding.

  2. Instead of looping over coverage.txt, you first read the regions.txt full into memory (I assume it is much smaller). For each region, you keep a score and a number.

  3. Process coverage.txt line by line. For each line, you determine the chromosome and the region that this particular stretch belongs to. This will require some footwork, but if regions.txt is not too large, it might be more efficient. Add the score to the score of the region and increment number by one.

An alternative, most efficient way requires both files to be sorted first by chromosome, then by position.

  1. Take a line from regions.txt. Record the chromosome and positions. If there is a line remaining from previous loop, go to 3.; otherwise go to 2.

  2. Take a line from coverage.txt.

  3. Check whether it is within the current region.

    • yes: add the score to the region, increment number. Move to 2.
    • no: divide score by number, write the current region to output, go to 1.

This last method requires some fine tuning, but will be most efficient -- it requires to go through each file only once and does not require to store almost anything in the memory.

于 2012-10-13T11:47:42.003 回答
1

join这是使用and的一种方法awk。像这样运行:

join regions.txt coverage.txt | awk -f script.awk - regions.txt

内容script.awk

FNR==NR && $4>=$2 && $5<=$3 { 
    sum[$1 FS $2 FS $3]+=$6
    cnt[$1 FS $2 FS $3]++
    next
}

{
    if ($1 FS $2 FS $3 in sum) {
        printf "%s  %.1f\n", $0, sum[$1 FS $2 FS $3]/cnt[$1 FS $2 FS $3]
    }
    else if (NF == 3) {
        print $0 "  0"
    }
}

结果:

1  100  200  6.7
1  400  600  0
2  600  700  12.5

或者,这是单线:

join regions.txt coverage.txt | awk 'FNR==NR && $4>=$2 && $5<=$3 { sum[$1 FS $2 FS $3]+=$6; cnt[$1 FS $2 FS $3]++; next } { if ($1 FS $2 FS $3 in sum) printf "%s  %.1f\n", $0, sum[$1 FS $2 FS $3]/cnt[$1 FS $2 FS $3]; else if (NF == 3) print $0 "  0" }' - regions.txt
于 2012-10-13T14:25:53.593 回答
0

这是一种简单的 MATLAB 方法,可以将您的覆盖范围划分为区域:

% extract the regions extents
bins = regions(:,2:3)';
bins = bins(:);

% extract the coverage - only the start is needed
covs = coverage(:,2);

% use histc to place the coverage start into proper regions
% this line counts how many coverages there are in a region
% and assigns them proper region ids.
[h, i]= histc(covs(:), bins(:));

% sum the scores into correct regions (second output of histc gives this)
total = accumarray(i, coverage(:,4), [numel(bins),1]);

% average the score in regions (first output of histc is useful)
avg = total./h;

% remove every second entry - our regions are defined by start/end
avg = avg(1:2:end);

现在,假设这些区域不重叠,这是可行的,但我想情况就是这样。此外,文件中的每个条目coverage都必须属于某个区域。

此外,如果您想避免读取整个文件,那么在覆盖范围内“阻止”这种方法是微不足道的。您只需要bins您的区域文件,该文件可能很小。您可以处理块中的覆盖率,逐步添加total并最终计算平均值。

于 2012-10-13T12:07:17.483 回答