2

I want to create a hash key in perl hash key that looks like this (lowerR-10,UpperR-12) => 1. Here the key is (lowerR-10,UpperR-12) and its value is 1.

Actually I have a file like this. I have to find the overlap among the the elements.

A 10 12

A 10 15

Whose output will be

A 10 12 2

A 12 15 1

The last column shows the overlap among the elements. I would like to save the count in a hash for which I think the key should be like (lowerR-10,UpperR-12) this. If anyone can give some new suggestion regarding how to save the key it will be great too.

Thanks

4

3 回答 3

0

Maybe the program below will get you close to a solution.

#!/usr/bin/perl
use strict;
use warnings;
use Set::IntSpan;
use Sort::Naturally;

my %data;

while (<DATA>) {
    my ($chr, @start_stop) = split; 
    $data{$chr}{$_}++ for $start_stop[0] .. $start_stop[1];
}

for my $chr (nsort keys %data) {
    my %counts;
    while (my ($range_int, $count) = each %{ $data{$chr} } ) {
        push @{ $counts{$count} }, $range_int;  
    }

    for my $count (sort {$a <=> $b} keys %counts) {
        my $set = Set::IntSpan->new(@{$counts{$count}});
        for my $run ($set->sets) {
            printf "%s %-10s count: %s\n", $chr, $run, $count;
        }
    }
    print "\n";
}

__DATA__
chr1    100    500
chr1    25      50
chr1    10       50
chr1    60       80
chr1    12       40
chr1    41       45
chr1     20      45
chr1     48      80
chr1    4   60
chr2    2   40
chr3    4   90
chr1    5   40
chr2    1   30
chr1    6   20
chr4    9   100
chr1    2   20
chr2    2   90
chr1    6   20
chr4    4   30
chr2    4   90
chr3    3   90
chr2    4   90
chr4    3   90
chr2    4   30

It produced this output.

chr1 2-3        count: 1
chr1 100-500    count: 1
chr1 4          count: 2
chr1 51-59      count: 2
chr1 61-80      count: 2
chr1 5          count: 3
chr1 46-47      count: 3
chr1 60         count: 3
chr1 48-50      count: 4
chr1 6-9        count: 5
chr1 21-24      count: 5
chr1 41-45      count: 5
chr1 10-11      count: 6
chr1 25-40      count: 6
chr1 12-19      count: 7
chr1 20         count: 8

chr2 1          count: 1
chr2 2-3        count: 3
chr2 41-90      count: 3
chr2 31-40      count: 4
chr2 4-30       count: 6

chr3 3          count: 1
chr3 4-90       count: 2

chr4 3          count: 1
chr4 91-100     count: 1
chr4 4-8        count: 2
chr4 31-90      count: 2
chr4 9-30       count: 3

Update: I'll try to explain. The %counts hash is created anew for each chromosome from the outer loop. The keys are the counts of each numbered position, say number 42 was seen 5 times. The value for each count is an anonymous array that has all numbers that were seen 5 times.

Set::IntSpan is used to create ranges, (6-8, 21-24, 41-45), from the anonymous array, (which has 6,7,8,21,22,23,24,41,42,43,44,45 as elements in the anon array). The line for my $run ($set->sets), gets each run list for numbers seen 5 times, (6-8, 21-24, 41-45) and then prints them. You can look at the documentation for Set::IntSpan although it doesn't provide many helpful examples, and I've not been able to find any other good examples by net search, sorry. But basically, you feed Set::IntSpan ranges of numbers and it can give you the condensed subsets, (6-8, 21-24, etc), or each individual number in a set depending on the Set::IntSpan method you use to access the data held by a IntSpan object.

Hope this clears up some questions you had. :-)

于 2012-10-21T19:18:46.623 回答
0

Actually I don't think a hash-based approach lends itself well to this problem; it is better to approach it as a list processing problem. I'll describe an approach in Haskell (this could be translated to Perl with only slightly more code).

The general approach is to keep a running tally of how many elements are overlapping at any given time. To do this, we'll build a list of "deltas" - changes in the running total - from the argument list of elements as (start, end) pairs. Then sort these deltas by where they occur, and combine ones that occur at the same time. Finally, keep a running sum of the deltas, which is equal to the number of overlapping elements at the start of each period (each period ends where the next begins).

Complete code:

import Data.List      -- sortBy
import Data.Ord       -- comparing
import Data.Function  -- `on`

withDelta delta x = (x, delta)

episodeToDelta elements = increases ++ decreases
  where increases = map (withDelta 1    . fst) elements
        decreases = map (withDelta (-1) . snd) elements

compactDelta::[(Integer,Integer)]->(Integer,Integer)
compactDelta (x:xs) = ((fst x),(sum $ map snd (x:xs)))

adjustTotal (t, delta) ((t1, total1):rest) = (t, total1 + delta):(t1,total1):rest

deltasToRunningTotal = reverse . foldr adjustTotal [(0,0)] . reverse

noZeros = filter (\(_, d) -> d /= 0)

elementsToRunningTotal = deltasToRunningTotal . noZeros . map compactDelta . groupBy ((==) `on` fst) . sortBy (comparing fst) . elementToDelta

sample_elements = [(10,15),(5,9),(13,22),(15,19),(14,16),(3,8),(2,12),(20,22),(23,29)]

> sortBy (comparing fst) sample_elements
[(2,12),(3,8),(5,9),(10,15),(13,22),(14,16),(15,19),(20,22),(23,29)]

> episodesToRunningTotal sample_elements
[(0,0),(2,1),(3,2),(5,3),(8,2),(9,1),(10,2),(12,1),(13,2),(14,3),(16,2),(19,1),(20,2),(22,0),(23,1),(29,0)]
于 2012-10-22T03:43:35.203 回答
0

I thought that showing what the hashes, (%data, %counts for each chromosome), might show better what each hash looks like. I use Data::Dumper often when I'm not sure how a hash is structured. It is a very useful module. If you run the program (below) it should better show the hash contents.

You should probably print the output in a file as there is too much the view on (my) terminal - like: perl your_name_program.pl > junk.txt.

Then open junk.txt to see the results.

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use Sort::Naturally qw/ nsort /;

my %data;

while (<DATA>) {
    my ($chr, @start_stop) = split; 
    $data{$chr}{$_}++ for $start_stop[0] .. $start_stop[1];
}

print "Printing \%data hash as follows\n";
print Dumper \%data;
print "\n";

for my $chr (nsort keys %data) {
    my %counts;
    while (my ($range_int, $count) = each %{ $data{$chr} } ) {
        push @{ $counts{$count} }, $range_int;  
    }
    print "Printing \%counts hash for chromosome $chr\n";
    print Dumper \%counts;
    print "\n";
}

__DATA__
chr1    100    500
chr1    25      50
chr1    10       50
chr1    60       80
chr1    12       40
chr1    41       45
chr1     20      45
chr1     48      80
chr1    4   60
chr2    2   40
chr3    4   90
chr1    5   40
chr2    1   30
chr1    6   20
chr4    9   100
chr1    2   20
chr2    2   90
chr1    6   20
chr4    4   30
chr2    4   90
chr3    3   90
chr2    4   90
chr4    3   90
chr2    4   30

(I'm not too clear on how to translate the Haskell solution to Perl at this point, but will give a closer examination later.)

Hope this gives a clearer picture.

于 2012-10-22T16:08:05.823 回答