perl - 遍历包含数十亿行的文件，输出出现次数最多的一行

Question

遍历具有多行国家/地区的文件并打印出出现次数最多的国家/地区的最佳算法/方法是什么？

每行都是一个字符串，每行只包含一个国家名称。

假设可能有 10 亿个不同的国家。（国家是一个坏例子）

United States
Iran
India
United States
China
Iran
....
....
Canada //1 billionth line

score 7 · Accepted Answer

# Count the unique elements.
my %hash;
while(<>) {
    chomp;
    $hash{$_}++;
}

# Find the key with the largest value.
sub largest_value {
    my $hash = shift;

    my ($big_key, $big_val) = each %$hash;

    while (my ($key, $val) = each %$hash) {
        if ($val > $big_val) {
            $big_key = $key;
            $big_val = $val;
        }
    }

    return $big_key;
}

print largest_value(\%hash);

score 2 · Accepted Answer

my $big_count = 0;
my @big_keys;

my %counts;
while (<>) {
    chomp;
    my $count = ++$counts{$_};

    if ($counts == $big_count) {
       push @big_keys, $_;
    }
    elsif ($count > $big_count) {
       $big_count = $count;
       @big_keys = $_;
    }
}

print(join(', ', @big_keys), "\n");

score 2 · Accepted Answer

您可以只使用整数哈希。虽然有很多行，但国家名称的数量有限，因此文件大小并不重要：

use strict;
use warnings;
my %hash;
while(<>) {
  chomp;
  $hash{$_}++;
}

my @sorted = sort { $hash{$b} <=> $hash{$b} } keys %hash;
print "$sorted[0]: $hash{$sorted[0]}\n";

score 0 · Accepted Answer

0

创建哈希表：key为国家名称，value为出现次数。

于 2012-12-13T04:29:08.340 回答

perl - 遍历包含数十亿行的文件，输出出现次数最多的一行

4 回答 4

Related

Reference