0

我有一个如下所示的数据集

1. 数据集

NR_046018   DDX11L1 ,   0   0   1   1   1   1   1   1   1      1    0   0   0   0   1.44    2.72    3.84    4.92
NR_047520   LOC643837   ,   3   2.2 0.2 0   0   0.28    1   1   1   1   2.2 4.8 5   5.32    5   5   5   5   3
NM_001005484    OR4F5   ,   2   2   2   1.68    1   0.48    0   0.92    1   1.8 2   2   2   2.04    3.88    3
NR_028327   LOC100133331    ,   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

2. 需要什么

  1. 将数组洗牌 10 次。每次洗牌后,将数组分成 2 个新数组,例如set1set2。(一半进入set1另一半进入set2

  2. 从每个新数组中,计算每行数字的最大值,然后计算所有行的平均最大值。

  3. 获取每个set1set2的 10 个平均最大值。(10 次随机播放的 10 个平均最大值)计算每个集合获得的 10 个平均最大值的平均值,我们称之为10avg110avg2

  4. 获取 1000 10avg2和 1000 10avg2的列表。

3.代码

use warnings;
use List::Util qw(max shuffle);

my $file = 'mergesmall.txt';

#Open file and output file
open my $fh,'<',$file or die "Unable to open file";
open OUT,">Shuffle.out" or die;

#Read into array
my @arr = <$fh>;

#Intialize loop for shuffling 10 times
my $i=10;
while($i){
    my @arr1 = ();  #Intitialize 1st set
    my @arr2 = ();  #Initialize 2nd set

    my @shuffled = shuffle(@arr);

    push @arr1,(@shuffled[0..1]); #Shift into 1st set
    push @arr2,(@shuffled[2..3]); #Shift into 2nd set



    foreach $_(@arr1){
        my @val1 = split;
        my $max1 = max(@val1[3..$#val1]);

         $total1 += $max1;
         $num1++;
    }

    my $average_max1 = $total1 /  $num1;
    #print "\n\n","Average max 1st set is : ",$average_max1;
    print OUT "Average max 1st set is : ",$average_max1;

        foreach $_(@arr2){
        my @val2 = split;
        my $max2 = max(@val2[3..$#val2]);

        print "\n\n";

         $total2 += $max2;
         $num2++;
    }

    my $average_max2 =  $total2 /  $num2;
    #print "\n\n","Average max 2nd set is : ",$average_max2;
    print OUT "\n","Average max 2nd set is : ",$average_max2,"\n\n";


    $i--;

}       

4.问题

到目前为止,我能够编写的代码可以获得每个set1set2的 10 个最大平均值。我无法弄清楚如何计算这 10 个最大平均值的平均值。如果我能弄清楚这一点,我可以轻松地for循环运行 1000 次并获得 1000 10avgset1和 1000 10avgset2

五、注意事项

  1. 实际数据集的每一行最多包含 400 个数字,有些行少于这个数,有些根本没有,但从不超过 400 个。

2.实际数据集有41,382行。Set1 将包含 23,558 行,而 set2 将包含 17,824 行。

3.File 是一个.txt 文件,每行中的所有数字都是制表符分隔的。

如果可以就如何计算最大平均值的平均值提供一些想法,我将不胜感激。我曾想过使用push @10avgset1, $average_max1,但我无法使其工作。

4

1 回答 1

2

我注意到的第一件事:您没有使用strict编译指示,实际上是在使用全局变量。我不确定这是否是你想要的。此外,变量名称可能不以数字开头(通常)。

我注意到的第二件事:你重复了很多次。

这是一个执行这种奇怪的“最大值平均”的函数:

use constant CARRY => 1; # set behaviour of original code;

sub make_accumulator {
    my $group = shift;
    my ($max, $num) = (0, 0) if CARRY;
    my @acc;
    my $acc = sub {
        my ($max, $num) = (0, 0) unless CARRY;
        for (@_) {
            $max += max @$_;
            $num++;
        }
        my $avg = $max / $num;
        push @acc, $avg;
        printf "Average max in set %d is %.2f\n", $group, $avg;
        $avg;
    };
    my $get = sub { @acc };
    ($acc, $get);
}

然后我们可以这样做my ($acc, $get) = make_accumulator(1),其中$acc是一个回调,它封装了您的算法,并$get返回到目前为止计算的所有此类值的数组。

实际平均值计算为

sub average { sum(@_) / @_ }

为了初始化脚本,我做了

#!/usr/bin/perl

use strict;
use warnings;
use List::Util qw(shuffle max sum);

use constant CARRY => 1;

my @arr = map {my @arr = split; [@arr[3..$#arr]]} <DATA>;

my ($acc1, $get1) = make_accumulator(1);
my ($acc2, $get2) = make_accumulator(2);

在加载期间,该@arr行仅解析该行一次。然后我继续循环几次改组的版本@arr

for (1 .. 5){
    my @shuffled = shuffle @arr;

    my $halfway = int (@shuffled / 2);
    my @arr1 = @shuffled[0 .. $halfway];
    my @arr2 = @shuffled[$halfway .. $#shuffled];

    my $average_max1 = $acc1->(@arr1);
    my $average_max2 = $acc2->(@arr2);

    printf "running: %.2f %.2f\n", average($get1->()), average($get2->());
    print "\n";
}

在这里,我将洗牌后的列表严格分成两半,您想23557稍后再进行硬编码。然后我打印 set1 和 set2 的运行平均值。

这会产生如下输出:

Average max in set 1 is 2.93
Average max in set 2 is 4.60
running: 2.93 4.60

Average max in set 1 is 3.17
Average max in set 2 is 4.60
running: 3.05 4.60

Average max in set 1 is 3.09
Average max in set 2 is 4.60
mrunning: 3.07 4.60

Average max in set 1 is 3.17
Average max in set 2 is 4.55
running: 3.09 4.59

Average max in set 1 is 3.22
Average max in set 2 is 4.03
running: 3.12 4.48

如果我设置CARRY为假值,我会得到

Average max in set 1 is 3.07
Average max in set 2 is 5.12
running: 3.07 5.12

Average max in set 1 is 3.07
Average max in set 2 is 2.46
running: 3.07 3.79

Average max in set 1 is 3.07
Average max in set 2 is 4.40
running: 3.07 3.99

Average max in set 1 is 3.41
Average max in set 2 is 4.40
running: 3.15 4.10

Average max in set 1 is 3.07
Average max in set 2 is 5.12
running: 3.14 4.30

n!/(n/2)!这看起来很愚蠢,因为四行(我猜)的可能组合很少。

当然,这些值每次运行时都会有所不同,因为shuffle已经是伪随机的。

编辑:

DATA文件句柄假定您在脚本末尾有一个数据部分,例如

__DATA__
NR_046018   DDX11L1 ,   0   0   1   1   1   1   1   1   1      1    0   0   0   0   1.44    2.72    3.84    4.92
NR_047520   LOC643837   ,   3   2.2 0.2 0   0   0.28    1   1   1   1   2.2 4.8 5   5.32    5   5   5   5   3
NM_001005484    OR4F5   ,   2   2   2   1.68    1   0.48    0   0.92    1   1.8 2   2   2   2.04    3.88    3
NR_028327   LOC100133331    ,   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

要使用命令行中列出的任何文件,请执行

my @arr = map {...} <>;  # no explicit filehandle

或手动打开文件。

于 2012-12-21T10:56:55.340 回答