perl - 从多个文件中查找公共键并将不同的值存储到数组中并计算差异

Question

我对 Perl 很陌生，我想通过 perl 完成一项任务：

我有很多文件看起来像这样：（空格分隔，每个有 6 列和数千行；所有文件都以 *.hgt 结尾）

例子.hgt

ID     NAMES           Test1       Test2       Percentage       Height
1      abc100123        A            B          0.21            165
1      abc400123        A            B          0.99            162
1      abc300123        C            B          0.107           165
1      abc200123        A            E          0.31            167
1      abc500123        A            B          0.7             165
....

每个 NAMES 在每个 .hgt 文件中都是唯一的。我想找到所有 .hgt 文件中常见的名称并提取所有百分比并找到最高和最低数字之间的最大差异。

例如，如果我有 5 个 .hgt 文件并且它们都包含 NAMES = abc300123，并且相应的百分比是：0.107、0.1、0.4、0.9、0.8，那么 abc300123 的最大差异应该是 0.9 - 0.1 = 0.8

然后我想输出从我的所有文件中计算出的与该名称相关的名称和最大差异。输出的顺序按最大差异排序。每行前面都有一个整数 (0, 1, 2, 3, ...)。一个示例如下所示：

输出

0. abc500123 0.1
1. abc900123 0.3
2. abc100123 0.7
3. abc300123 0.8
4. abc110123 0.9
....

我试图通读每个文件并将键 = NAMES 和值 = 百分比存储到数组中。我想对百分比数组进行排序并将最大值和最小值存储到新数组中并进行减法计算。在某种程度上，我被卡住了，无法把事情放在一起。

这是我到目前为止写的：

open(PIPEFROM, "ls *.hgt |") or die "no \.hgt files founded\!\n";  ## find the files that are ended with hgt
$i=0;
@filenames = "";

while($temp = <PIPEFROM>){

    $temp =~ m/\.hgt/;
    print out "$temp";
    $pre = $`; #gives file name without the dot and the hgt extension
    $filenames[$i] = $pre;
    $i++;
} 


%hash = ();
$j=0;
## read in files ended with .hgt
for ($i = 0; $i<=$filenames; $i++) {
$temp = $filenames[$i];

open(PIPETO, "cat $temp.hgt |") or die "no \.hgt files founded\!\n";

<PIPETO>;
while ($temp2 = <PIPETO> ){
    chomp $temp2;
    $temp2 = ~ s/^\s+//;
    @lst = split(/\s+/, $temp2);
    $NAMES = $lst[1];
    $Percentage = $lst[4];
    $hash{$NAMES} .= $Percentage . " ";
}
}
### manipulate the values
foreach $key (sort keys %hash){

    @values = split(/\s+/, $hash{$key});
    if ($#values == $#filenames){
    print "$j" . "\." . " " . "$key" . "\n";
    $j++;
                         ### got stuck
}
}

我正在考虑将其包含在问题中，但我不知道该放在哪里：

my ($smallest, $largest) = (sort {$a <=> $b} @array)[0,-1];

这太令人沮丧了。任何帮助将不胜感激！

score 2 · Accepted Answer

基于 Joseph Myers 的回复，我做了一些更改来回答您关于如何仅获取所有文件中出现的数据、如何跳过标题行（输入文件中的第 1 行）以及对输出进行排序的问题按百分比从大到小，百分比相等时按名称排序。您运行程序的命令行条目如下：

perl output.pl *.hgt.

my $file_count = @ARGV or die "invoke program as:\nperl $0 *.hgt\n";

这会将所有 *.hgt 读入 @ARGV 数组（而不是像他的程序那样通过 cat 管道输入）。$file_count然后将记录读入的文件数。while循环读取包含在的文件@ARGV，类似于管道猫。

在第一个for循环中，检查是否在每个文件中都读入了名称 ( if ($names{$name}{count} == $file_count))。如果是，它计算百分比之间的差异，如果不是，则从%names哈希中删除名称。

最后一个for循环使用自定义排序打印结果，by_percent_name.

#!/usr/bin/perl
use strict;
use warnings;

my $file_count = @ARGV or die "invoke program as:\nperl $0 *.hgt\n";

my %names;
while (<>) {
    next if $. == 1; # throw header out
    my ($name, $perc) = (split ' ')[1,4];
    $names{$name}{count}++;
    my $t = $names{$name}{minmax} ||= [1,0];
    $t->[0] = $perc if $perc < $t->[0];
    $t->[1] = $perc if $perc > $t->[1];
    close ARGV if eof; # reset line counter, '$.',  to 1 for next file
}

for my $name (keys %names) {
    if ($names{$name}{count} == $file_count) {
        $names{$name} = $names{$name}{minmax}[1] - $names{$name}{minmax}[0];
    }
    else {
        delete $names{$name};   
    }
}

my $i;
my $total = keys %names;
for my $name (sort by_percent_name keys %names) {
    printf "%*d. %s %.6f\n", length($total), ++$i, $name, $names{$name};
}

sub by_percent_name {
    $names{$b} <=> $names{$a}   || $a cmp $b
}

score 1 · Accepted Answer

该程序完全按照您指定的方式执行：

# output.pl
# save this entire script as output.pl
# obtain output by running this command:
#
#   cat *.hgt | perl output.pl | more
# (in order to scroll the results--press "q" in order to quit)
#
#   cat *.hgt | perl output.pl > results-largest-differences-output-$$.txt
# in order to create a temporary results file
#
# BE CAREFUL because the second command overwrites whatever is in
# the output file using the ">" operator!
my %names;
my $maxcount = `ls *.hgt | wc -l`;
my %counts;
while (<>) {
my @fields = (m/(\S+)/g);
my $name = $fields[1];
my $perc = $fields[4];
next if $perc =~ m/[^.\d]/;
next unless $perc;
my $t = ($names{$name} ||= [1, 0]);
# initialize min to as high as possible and max to as low as possible
$t->[0] = $perc if $perc < $t->[0];
$t->[1] = $perc if $perc > $t->[1];
$counts{$name}++; # n.b. undef is auto-initialized to 0 before ++
}

for (keys %names) {
$names{$_} = $names{$_}->[1] - $names{$_}->[0];
}

my $n = 0;
for (sort { $names{$a} <=> $names{$b} || $a cmp $b } keys %names) {
next unless $counts{$_} == $maxcount;
$n++;
printf("%6s %20s %.2f\n", $n, $_, $names{$_});
}

perl - 从多个文件中查找公共键并将不同的值存储到数组中并计算差异

2 回答 2

Related

Reference