perl - 使用 bash 或 perl 对包含浮点数的 .csv 文件的列进行平均

Question

我有几千个文件，其中包含以下数据：

bash$ cat somefile0001.csv
col1;col2;col3; ..... ;col10
2.34;0.19;6.40; ..... ;4.20
3.8;2.45;2.20; ..... ;5.09E+003

基本上，它是一个 10x301 的 feild .csv 文件，其中包含顶部由分号分隔的头文件（为简洁起见，不包括孔的东西）。

所以我的目标是将科学记数法更改为十进制数，将所有列平均在一起，然后将列标题与列平均值一起输出到一个新的 csv 文件中，然后再到数千个文件中。

我已经有工作代码来解析所有文件，我似乎无法获得让平均工作的部分

 #!/bin/bash
 filename=csvfile.csv
 i=1
      runningsum=0
      echo ""> $filename.tmp.$i
      tmptrnfrm=$(cut -f$i -d ';' $filename)
      tmpfilehold=$filename.tmp.$i
      echo "$tmptrnfrm" >> $tmpfilehold
      trnsfrmcount=0

      for j in $(cat $tmpfilehold)
      do
           if [[ $trnsfrmcount = 0 ]]]
           then
                echo -n "Iteration $trnsfrmcount:"
                echo "$j" #>> $tmpfilehold
                trnsfrmcount=$[$trnsfrmcount+1]
           elif [[ $trnsfrmcount < 301 ]]
           then
                if [[ $(echo $j | sed 's/[0-9].[0-9][0-9]E+[0-9]/arbitrarystring/' ) == arbitrarystring ]]
                then
                     tempj=$(printf "%0f" $j)
                     runningsum=$(echo '$runningsum + $tempj' | bc)
                     echo "$j" #>> tmpfilehold
                     trnsfrmcount=$[$trnsfrmcount+1]
                else
                     echo "preruns: $runningsum"
                     runningsum=$(echo '$runningsum + $j' | bc)
                     echo "$j," #>> $tmpfilehold
                     echo "the running sum is: $runningsum"
                     trnsfrmcount=$[$trnsfrmcount+1]
                fi
           fi
      done
 totalz=$(echo '$runningsum / 300' | bc)
 echo "here is the total"
 echo "$totalz"

 exit 0

我知道这有点乱，我在标准输出上放了很多额外的字符串，看看运行时发生了什么。我想在 perl 中执行此操作，但我只是在学习并知道这可以使用 bash 来完成，而且我无法访问 CSV 模块，也无法安装它（否则它可能真的很容易）。

任何帮助是极大的赞赏。

score 1 · Accepted Answer

这是一个基本的 perl 脚本，它应该可以满足您的需求。我没有测试过。

#!/usr/bin/perl 
use strict;
use warnings;

my $infile = shift;
my $outfile = shift || $infile . ".new";

my $header = "";
my $count  = 0;
my @sums   = ();
my @means  = ();

open my $fin, '<', $infile or die $!;

$header = <$fin>;
@sums = map { 0 } split ";", $header;    # to initialize @sums;

while ( my $line = <$fin> ) {
    chomp $line;

    my @fields = split ";", $line;
    for ( my $i = 0 ; $i < scalar @fields ; $i++ ) {

        # use sprintf to convert to decimal notation
        # if we think we are using scientific notation
        if ( $fields[$i] =~ m/E/i ) {
            $sums[$i] += sprintf( "%.2f", $fields[$i] );
        } else {
            $sums[$i] += $fields[$i];
        }
    }

    $count++;
}

close $fin;

exit 1 if $count == 0;

# calculate averages
@means = map { sprintf( "%.2f", $_ / $count ) } @sums;

# intentionally left out writing to a file
print $header;
print join( ";", @means ) . "\n";

score 0 · Accepted Answer

Tabulator is a set of unix command line tools to work with delimited files that have header lines. Here is an example to compute the average of the first three columns:

tblred -d';' -su -c'avg1_col=avg(col1),avg_col2=avg(col2)' somefile00001.csv

produces

avg1_col;avg_col2
3.07;1.32

perl - 使用 bash 或 perl 对包含浮点数的 .csv 文件的列进行平均

2 回答 2

Related

Reference