1

我正在尝试使用自动化脚本进行标准化(从原始值中减去平均值并除以 stdev)。我有一个包含 10000 行的文件。我需要最初计算每列的平均值和标准偏差,然后使用这些值我必须获得新的标准化值。我可以在 excel 中很容易地做到这一点。但我正在寻找一个自动化脚本。

输入

DOTR1   10.29006    10.06744    10.47105    10.05041    10.18407    9.770205    10.90548    10.75112
RCC2    6.699481    7.240353    7.263434    6.654058    6.86063 7.151931    6.796337    6.78525
HHPA6   7.31182 7.547056    8.338827    7.278408    7.545548    7.409964    7.149899    7.300342
PAX8    8.336847    8.651292    8.493323    8.5056  8.445139    8.651406    8.664237    8.56571
ACA1A   4.233111    4.320666    4.232803    4.390224    4.269969    4.314899    4.264211    4.142419
UBA7    8.196608    8.164725    7.361889    8.055019    8.882745    7.6884  7.835754    8.354209
OOA 5.098222    5.212986    5.301191    5.211401    5.13133 5.153725    5.269111    5.195991
ACX1    4.875679    5.01305 4.921618    4.930978    4.899562    4.92918 4.970339    4.986362    

第 1 列的平均值为 6.880,stdev 为 2.066

我现在将从我的观察中减去平均值,然后除以 stdev 到 (10.29006-6.880)/2.066。我将对第 1 列中的所有后续观察逐行执行此操作。对于第 2 列,我将再次找到它的平均值和相应的标准差,并遵循相同的程序。

谢谢,

我尝试了以下代码来获取 avg 和 stdev .. 我坚持进行下一步..

sub average{
    my($data) = @_;
    if (not @$data) {
            die("Empty array\n");
    }
    my $total = 0;
    foreach (@$data) {
            $total += $_;
    }
    my $average = $total / @$data;
    return $average;
}
 sub stdev{
    my($data) = @_;
    if(@$data == 1){
            return 0;
    }
    my $average = &average($data);
    my $sqtotal = 0;
    foreach(@$data) {
            $sqtotal += ($average-$_) ** 2;
    }
    my $std = ($sqtotal / (@$data-1)) ** 0.5;
    return $std;
}
4

2 回答 2

0

只需使用数组数组来表示表格。逐列遍历表格,获取均值和标准差,然后替换列中的每个值。

#!/usr/bin/perl
use warnings;
use strict;

open my $IN, '<', 'input' or die $!;

my @table;

while (<$IN>) {
    $table[$. - 1] = [ split ];
}

for my $column (1 .. $#{ $table[0] }) {

    my $total = 0;
    $total   += $_ for map $table[$_][$column], 0 .. $#table;
    my $mean  = $total / @table;

    my $sqtot = 0;
    $sqtot   += ($mean - $_) ** 2 for map $table[$_][$column], 0 .. $#table;
    my $stdev = ($sqtot / $#table) ** 0.5;

    $table[$_][$column] = ($table[$_][$column] - $mean) / $stdev for 0 .. $#table;
}

$\ = "\n";
for my $line (@table) {
    print join "\t", @$line;
}
于 2013-01-30T21:41:30.293 回答
0

我想我会发布一个我想出的解决方案,尽管它不像 choroba 的那么简单。它使用Statistics::Descriptive

更新:嗯,这不是一个很好的解决方案 - 当一个解决方案可能只需要一个时创建 3 个数组。忽略此解决方案。

#!/usr/bin/perl
use strict;
use warnings;
use Statistics::Descriptive;

my @data = map [split], <DATA>;

my @transpose = transpose(@data);
my @stats;

for my $row (@transpose[1.. $#transpose]) {
    my $stat = Statistics::Descriptive::Full->new;
    $stat->add_data($row);
    push @stats, [$stat->mean, $stat->standard_deviation];
}

my @new;

for my $r (0 .. $#data) {
    my @tmp;
    for my $c (1 .. $#{$data[$r]}) {
        push @tmp, ($data[$r][$c] - $stats[$c-1][0]) / $stats[$c-1][1];
    }
    push @new, [$data[$r][0], map {sprintf "%.3f", $_} @tmp];
}

# output loop
for my $row (@new) {
    print join("\t", @$row), "\n";  
}

sub transpose {
    my @array = @_;

    my @trans;
    for my $i (0 .. $#array) {
        for my $j (0 .. $#{$array[$i]}) {
            $trans[$j][$i] = $array[$i][$j];    
        }   
    }
    return @trans;
}

__DATA__
DOTR1   10.29006    10.06744    10.47105    10.05041    10.18407    9.770205    10.90548    10.75112
RCC2    6.699481    7.240353    7.263434    6.654058    6.86063 7.151931    6.796337    6.78525
HHPA6   7.31182 7.547056    8.338827    7.278408    7.545548    7.409964    7.149899    7.300342
PAX8    8.336847    8.651292    8.493323    8.5056  8.445139    8.651406    8.664237    8.56571
ACA1A   4.233111    4.320666    4.232803    4.390224    4.269969    4.314899    4.264211    4.142419
UBA7    8.196608    8.164725    7.361889    8.055019    8.882745    7.6884  7.835754    8.354209
OOA 5.098222    5.212986    5.301191    5.211401    5.13133 5.153725    5.269111    5.195991
ACX1    4.875679    5.01305 4.921618    4.930978    4.899562    4.92918 4.970339    4.986362

打印出来:

C:\Old_Data\perlp>perl t33.pl
DOTR1   1.650   1.516   1.624   1.610   1.490   1.502   1.797   1.698
RCC2    -0.087  0.106   0.102   -0.117  -0.079  0.140   -0.085  -0.102
HHPA6   0.209   0.259   0.612   0.200   0.245   0.274   0.077   0.132
PAX8    0.705   0.810   0.686   0.824   0.669   0.920   0.770   0.706
ACA1A   -1.281  -1.349  -1.335  -1.268  -1.301  -1.336  -1.244  -1.302
UBA7    0.637   0.567   0.149   0.595   0.875   0.419   0.391   0.610
OOA     -0.862  -0.904  -0.829  -0.851  -0.895  -0.900  -0.784  -0.824
ACX1    -0.970  -1.004  -1.009  -0.993  -1.004  -1.017  -0.921  -0.919
于 2013-01-30T23:08:58.547 回答