要回答有关如何获取所有可能的项目组合的问题,您可以使用Algorithm::Combinatorics模块。
#!/usr/bin/perl
use warnings;
use strict;
use Algorithm::Combinatorics qw(subsets);
my %freq; #save the frequencies
open (my $IN,'<','dataset-1.txt') or die $!;
while (<$IN>) {
chomp;
my @words=sort split/\s+/;
my @itemsets=subsets(\@words); #we take every subset of the items of each row
for (@itemsets) {
$freq{join(',',@$_)}++;
}
}
close $IN;
for (sort {$freq{$b}<=>$freq{$a}} keys %freq) { #print the frequency
next if $_ eq ''; #skip empty itemset if present
print "$_=>$freq{$_}\n";
}
对于您的数据集,这将打印:
chain=>4
mango=>3
chain,mango=>3
onion=>2
chain,eggs=>2
eggs=>2
nintendo,onion=>2
nintendo=>2
chain,eggs,mango,onion=>1
chain,eggs,mango=>1
eggs,nintendo=>1
cream,eggs=>1
chain,corn=>1
doll,onion=>1
chain,umbrella=>1
eggs,nintendo,onion=>1
chain,cream=>1
eggs,mango=>1
eggs,mango,onion=>1
chain,corn,eggs=>1
corn,cream,eggs=>1
umbrella=>1
mango,nintendo,onion=>1
chain,eggs,nintendo=>1
mango,nintendo=>1
chain,cream,eggs=>1
apple,mango=>1
chain,nintendo,onion=>1
chain,corn,cream,eggs=>1
doll,nintendo,onion=>1
chain,eggs,nintendo,onion=>1
chain,mango,nintendo=>1
chain,eggs,mango,nintendo=>1
corn=>1
cream=>1
doll=>1
chain,eggs,mango,nintendo,onion=>1
chain,mango,onion=>1
chain,onion=>1
mango,onion=>1
chain,corn,cream=>1
corn,cream=>1
doll,nintendo=>1
apple=>1
chain,mango,nintendo,onion=>1
apple,chain=>1
apple,chain,mango=>1
eggs,mango,nintendo,onion=>1
eggs,mango,nintendo=>1
corn,eggs=>1
mango,umbrella=>1
chain,mango,umbrella=>1
eggs,onion=>1
chain,nintendo=>1
chain,eggs,onion=>1
请注意,这不是Apriori 算法,它只是您问题的答案。但是知道如何进行组合应该很容易从那里得到它,定义一个支持阈值,清除频率较低的项集,然后继续。
现在,由于 Apriori 用于派生关联规则,因此您可以跳过所有痛苦并使用Tree::FP模块(它实现了频繁模式树)。我今天尝试了它,不幸的是发现它有问题(取决于数据集和定义的支持阈值可能会引发错误),但以下代码仍然有效并返回规则:
#!/usr/bin/perl
use warnings;
use strict;
use Tree::FP;
use POSIX qw(ceil);
my %words;
my $min_sup=20; #support is defined to find "large" itemsets. just an example, choose your own values!
my $min_conf=60; #confidence is defined to find "strong" associations. just an example, choose your own values!
{
open (my $IN,'<','dataset-1.txt') or die $!;
while (<$IN>) {
chomp;
$words{$_}++ for split/\s+/;
}
close $IN;
}
my @sorted=sort {$words{$b}<=>$words{$a}} keys %words; #we prune itemsets with frequencies inferior to the minimum support and sort them by frequency
my $fptree=Tree::FP->new(@sorted); #create a new Frequent-Pattern Tree
$fptree->set_support($min_sup/100); #set support
$fptree->set_confidence($min_conf/100); #set confidence (note: it actually doesn't filter the results as expected; known issue of the module)
{ #populate the tree
open (my $IN,'<','dataset-1.txt') or die $!;
while (<$IN>) {
chomp;
$fptree->insert_tree(split/\s+/) or die "Error while insert row $.: ",$fptree->err;
}
close $IN;
}
my @rules=$fptree->association_rules;
for (@rules) {
next if $_->confidence < $fptree->confidence;
print '{',join(',',$_->left),'} => {',join(',',$_->right),'} ',sprintf('support:%.2f, confidence:%.2f',$_->support,$_->confidence),"\n";
}