1

我的程序有这个输入:

miRNA127    dvex589433  131 154 -   24  87.5    atcgtaacgtatctcccacactta    32  55  98
miRNA32 dvex320240  61  83  -   23  86.9565217391304    cttctacaatggtactgtccatt 31  53  97
miRNA32 dvex623745  141 163 -   23  86.9565217391304    ggtttcttccacaatagtaattt 26  48  97
miRNA79 dvex468096  702 733 -   32  81.25   ttggttaaaaatttttttttttaattaaaaaa    6   37  55
miRNA79 dvex468096  717 743 +   27  81.4814814814815    aaaaaatttttaaccaaagaaaaaaat 13  39  55
miRNA79 dvex468096  694 718 -   25  84  tttttttaattaaaaaacaattttt   17  41  55
miRNA79 dvex468096  696 724 +   29  75.8620689655172    aaattgttttttaattaaaaaaaaaaatt   13  41  55
miRNA79 dvex219016  1103    1130    +   28  78.5714285714286    aaatttttgctaaaaaatacaaaaattt    14  41  55
miRNA79 dvex219016  3420    3446    +   27  77.7777777777778    aaaatattattaaataaataatgcaat 13  39  55
miRNA79 dvex219016  1384    1408    +   25  80  tttcgtgaaacaaaaaagtttggaa   21  45  55
miRNA79 dvex219016  4384    4424    +   25  80  tttcgtgaaacaaaaaagtttggaa   21  45  55
miRNA154    dvex573491  297 324 +   28  78.5714285714286    cagcttgattttaagcctatctgaaagc    23  50  76
miRNA154    dvex546562  232 259 +   28  78.5714285714286    cagcttgattttaagcctatttgaaagc    23  50  76
miRNA154    dvex648254  147 172 +   26  80.7692307692308    aagcctacggagtgcgaggcagagct  47  72  76
miRNA154    dvex648254  277 303 +   26  80.7692307692308    aagcctacggagtgcgaggcagagct  47  72  76

如果具有相同的 $1、$2 和 $5 值,我需要分组。因此我决定使用具有不同嵌套数组的哈希:

$VAR1 = {
    'miRNA79 dvex219016 +' => [
        [ '1103', '1130', '14', '41', '55' ],
        [ '3420', '3446', '13', '39', '55' ],
        [ '1384', '1408', '21', '45', '55' ],
        [ '4384', '4424', '21', '45', '55' ]
    ],
    'miRNA79 dvex468096 +' => [
        [ '717', '743', '13', '39', '55' ],
        [ '696', '724', '13', '41', '55' ]
    ],
    'miRNA154 dvex546562 +' => [ [ '232', '259', '23', '50', '76' ] ],
    'miRNA79 dvex468096 -' => [
        [ '702', '733', '6',  '37', '55' ],
        [ '694', '718', '17', '41', '55' ]
    ],
    'miRNA154 dvex648254 +' => [
        [ '147', '172', '47', '72', '76' ],
        [ '277', '303', '47', '72', '76' ]
    ],
    'miRNA127 dvex589433 -' => [ [ '131', '154', '32', '55', '98' ] ],
    'miRNA154 dvex573491 +' => [ [ '297', '324', '23', '50', '76' ] ],
    'miRNA32 dvex320240 -'  => [ [ '61',  '83',  '31', '53', '97' ] ],
    'miRNA32 dvex623745 -'  => [ [ '141', '163', '26', '48', '97' ] ]
};

之后,我针对散列的每个键的嵌套数组的 [0]->[0] 值进行了组织。如果嵌套数组有 1 个数组,我会打印它。但是如果有 1< 我需要对它进行分组。接下来我展示一个分组示例:

'miRNA79 dvex468096 -' => [
    [ '702', '733', '6',  '37', '55' ],
    [ '694', '718', '17', '41', '55' ]
    ],

整理一下:

$VAR1 = [ [ 696, '724', '13', '41', '55' ],
          [ 717, '743', '13', '39', '55' ] ];

如果 [1][1] 和 [0][0] 之间的差异小于或等于 [0][4] 我需要将其组合并生成这个新数组:

$VAR1 = [ [ 696, '743', '13', '39', '55' ], ];

并打印出来。在这种情况下:

$VAR1 = [
    [ 1103, '1130', '14', '41', '55' ],
    [ 1384, '1408', '21', '45', '55' ],
    [ 3420, '3446', '13', '39', '55' ],
    [ 4384, '4424', '21', '45', '55' ]
];

评估 [1][1] 和 [0][0] 是否小于或等于 [0][4],FALSE,所以我需要提取第一个嵌套数组并打印它,然后再次迭代以评估最后一个健康)状况。如果它生成我需要组合的 TRUE 值,如果评估生成 FALSE 值,我需要提取第一个嵌套数组并打印它。接下来,我的代码:

#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper;
use List::Util qw/ min max /;
use List::Util qw(sum);
use Math::MatrixReal;

my %data;
my $val;
my $num;
my $start;
my $end;
my $diff;
my $start_q;
my $end_q;
my @new_data;
my @extract;
my @extract2;
my $limit;

while (<>) {
    chomp;
    my @fields = split;
    push @{ $data{"@fields[0,1,4]"} }, [ @fields[ 2, 3, 8, 9, 10 ] ];
}

foreach my $key ( sort keys %data ) {
    $val = $data{$key};
    $num = scalar @$val;
    next if $num == 0;

    if ( $num == 1 ) {    # print if the hash have 1 nested array
        print
            "$key\t $data{$key}[0][0]\t $data{$key}[0][1]\t $data{$key}[0][2]\t $data{$key}[0][3]\t $data{$key}[0][4]\n";
    }
    else {
        foreach my $keys ( @$val[0] ) {
            my @sorted = sort { $a->[0] <=> $b->[0] }
                @$val;    #organize the nested array values
            $start   = $sorted[0][0];
            $end     = $sorted[1][1];
            $limit   = $sorted[0][4];
            $diff    = $end - $start;
            $start_q = $sorted[0][2];
            $end_q   = $sorted[1][3];

            if ( $diff < $limit ) {
                @new_data = ();
                push( @new_data, $start );
                push( @new_data, $end );
                push( @new_data, $start_q );
                push( @new_data, $end_q );
                push( @new_data, $limit );
                @extract = splice( @{ $sorted[0] }, 0, 5, @new_data );
                @extract2 = splice( @{ $sorted[1] } );
            }
            else {
                my @toprint = splice( @{ $sorted[0] } );
                print
                    "$key\t$toprint[0]\t$toprint[1]\t$toprint[2]\t$toprint[3]\t$toprint[4]\n";
            }
        }
    }
}

一般来说,我有这个结果:

miRNA127 dvex589433 -    131     154     32  55  98
miRNA154 dvex546562 +    232     259     23  50  76
miRNA154 dvex573491 +    297     324     23  50  76
miRNA154 dvex648254 +   147 172 47  72  76 
miRNA32 dvex320240 -     61  83  31  53  97
miRNA32 dvex623745 -     141     163     26  48  97
miRNA79 dvex219016 +    1103    1130    14  41  55

但是在这些列表中,一些值没有出现,因为如果条件为 TRUE,我的代码不会迭代。一些建议?

4

1 回答 1

0

我不确定,但我认为您正在尝试将一些 RNA 序列(?)合并为一个,当它们足够接近时(结果长度小于某个限制)。您可能正在寻找这样的代码:

#!/usr/bin/perl

use strict;
use warnings;

# Input data format positions
use constant KEY_FIELDS => ( 0, 1, 4 );
use constant DATA_FIELDS => ( 2, 3, 8, 9, 10 );

# Entry positions (DATA_FIELDS meanings)
use constant {
    START_P => 0,
    END_P   => 1,
    START_Q => 2,
    END_Q   => 3,
    LIMIT   => 4
};

# Output formatter
use constant TO_PRINT => START_P .. LIMIT;

sub format_entry {
    my ( $key, $data ) = @_;
    join "\t", $key, @$data[TO_PRINT];
}

# Read Data
my %data;
while (<>) {
    chomp;
    my @fields = split;
    push @{ $data{"@fields[KEY_FIELDS]"} }, [ @fields[DATA_FIELDS] ];
}

# Transform data to keep only records supposed to appear in output
for my $value ( values %data ) {
    my @entries = sort { $a->[START_P] <=> $b->[START_P] } @$value;
    my @result = ( shift @entries );    # add first one as reference
    while (@entries) {
        my $ref   = $result[-1];        # reference entry
        my $entry = shift @entries;
        if ( $entry->[END_P] - $ref->[START_P] < $ref->[LIMIT] ) {

            # merge entry into reference
            @$ref[ END_P, END_Q ] = @$entry[ END_P, END_Q ];
        }
        else {
            push @result, $entry;
        }
    }
    $value = \@result;                  # rewrite value in %data hash
}

# Write output
for my $key ( sort keys %data ) {
    print format_entry( $key, $_ ), "\n" for @{ $data{$key} };
}

您问题中数据的结果是:

miRNA127 dvex589433 -   131     154     32      55      98
miRNA154 dvex546562 +   232     259     23      50      76
miRNA154 dvex573491 +   297     324     23      50      76
miRNA154 dvex648254 +   147     172     47      72      76
miRNA154 dvex648254 +   277     303     47      72      76
miRNA32 dvex320240 -    61      83      31      53      97
miRNA32 dvex623745 -    141     163     26      48      97
miRNA79 dvex219016 +    1103    1130    14      41      55
miRNA79 dvex219016 +    1384    1408    21      45      55
miRNA79 dvex219016 +    3420    3446    13      39      55
miRNA79 dvex219016 +    4384    4424    21      45      55
miRNA79 dvex468096 +    696     743     13      39      55
miRNA79 dvex468096 -    694     733     17      37      55
于 2015-03-16T20:49:11.527 回答