我试图了解您在代码中所做的事情并对其进行改进以做您想做的事情。免责声明:这不是那么简单,例如,算法无法看到您不想分组44848..
并且4492...
要分组,44.....
而是要分组4492...
而不是44924..
等等。但也许这已经可以帮助你了。
I think the important part is the "smart filter" which for example looks at 336
and 3368
and deletes the count of 336
if it isn't higher than the other (336
marks a trivial super set of 3368
). Important here is the string-sort together with the state
variable $last
:
#!/usr/bin/env perl
use strict;
use warnings;
use feature qw(say state);
use List::Util 'shuffle';
# shuffled phone numbers (don't make it too easy)
my @numbers = shuffle (
4484800 .. 4484899,
3368700 .. 3368799,
4492000 .. 4492999
);
my %count = ();
# import phone numbers
foreach my $number (@numbers) {
# work on all substrings from the beginning
for (my $pos = 1; $pos <= length $number; $pos++) {
my $prefix = substr $number, 0, $pos;
$count{$prefix}++; # increase the number of equal prefixes
}
}
# smart filter
foreach my $prefix (sort {$a cmp $b} keys %count) {
state $last //= 'nothing';
# delete trivial super sets
if ($prefix =~ /^\Q$last/ and $count{$last} == $count{$prefix}) {
delete $count{$last};
}
# delete trivial sets
if ($count{$prefix} == 1) {
delete $count{$prefix};
next;
}
# remember the last prefix
$last = $prefix;
}
# output
say "$_ ($count{$_})" for sort {
$count{$b} <=> $count{$a} or $a cmp $b
} keys %count;
The output is absolutely right but not yet what you want:
44 (1100)
4492 (1000)
33687 (100)
44848 (100)
44920 (100)
44921 (100)
44922 (100)
44923 (100)
44924 (100)
44925 (100)
44926 (100)
44927 (100)
44928 (100)
44929 (100)
336870 (10)
(large list of 10-groups)
So if you want to get rid of the 10-groups, you could change
# delete trivial sets
if ($count{$prefix} == 1) {
delete $count{$prefix};
next;
}
to
# delete trivial sets
if ($count{$prefix} <= 10) {
delete $count{$prefix};
next;
}
Output:
44 (1100)
4492 (1000)
33687 (100)
44848 (100)
44920 (100)
44921 (100)
44922 (100)
44923 (100)
44924 (100)
44925 (100)
44926 (100)
44927 (100)
44928 (100)
44929 (100)
This looks very good. Now it's up to you what to do with the 4492
-100-groups and the 44
-1100-group. If you want to delete the 100-groups depending on their length, that could also delete the 4492
group in favor of the large 44
group.