遍历具有多行国家/地区的文件并打印出出现次数最多的国家/地区的最佳算法/方法是什么?
每行都是一个字符串,每行只包含一个国家名称。
假设可能有 10 亿个不同的国家。(国家是一个坏例子)
United States
Iran
India
United States
China
Iran
....
....
Canada //1 billionth line
# Count the unique elements.
my %hash;
while(<>) {
chomp;
$hash{$_}++;
}
# Find the key with the largest value.
sub largest_value {
my $hash = shift;
my ($big_key, $big_val) = each %$hash;
while (my ($key, $val) = each %$hash) {
if ($val > $big_val) {
$big_key = $key;
$big_val = $val;
}
}
return $big_key;
}
print largest_value(\%hash);
my $big_count = 0;
my @big_keys;
my %counts;
while (<>) {
chomp;
my $count = ++$counts{$_};
if ($counts == $big_count) {
push @big_keys, $_;
}
elsif ($count > $big_count) {
$big_count = $count;
@big_keys = $_;
}
}
print(join(', ', @big_keys), "\n");
您可以只使用整数哈希。虽然有很多行,但国家名称的数量有限,因此文件大小并不重要:
use strict;
use warnings;
my %hash;
while(<>) {
chomp;
$hash{$_}++;
}
my @sorted = sort { $hash{$b} <=> $hash{$b} } keys %hash;
print "$sorted[0]: $hash{$sorted[0]}\n";
创建哈希表:key为国家名称,value为出现次数。