perl - 优化 Unix 搜索

Question

我有 20 个 500k 行的文件，每行有 2 个数字。目标是获得每个 A 号码的不同对 (AB) 相对于 A 号码总数的百分比。所以，结果应该是这些文件中的一个数字和他的百分比。

例如：

1 1

1 1

1 1

1 2

应该给我 1 50%（总共 4 A 中有 2 对不同的对）。

以下方式太慢：不同的数量

cat files | sort | uniq -c

总数

cat files | cut -f1 | sort | uniq -c

然后遍历这些结果并计算每个 A 数的百分比。

如何为此最好地优化查询（bash/perl）？另外，如果这应该只对这些 A 数的子集进行，如何优化它？（例如，对于 20k 一个数字，不是全部 500k）

提前谢谢了

score 1 · Accepted Answer

Perl 解决方案。尝试运行它script.pl files，看看它运行得有多快。

#!/usr/bin/perl
use warnings;
use strict;

my %hash;
while (<>) {
    my @nums = split;
    $hash{$nums[0]}{$nums[1]}++;
}

#for my $num (sort { $a <=> $b } keys %hash) {    
for my $num (keys %hash) {
    my @values = values %{ $hash{$num} };
    my $sum;
    $sum += $_ for @values;
    my $perc = 100 * @values / $sum;
    print "$num $perc%\n";
}

sort如果您希望输出按第一个数字排序，请取消注释该行（并注释以下行）。

perl - 优化 Unix 搜索

1 回答 1

Related

Reference