1

我是 Perl 的新手,正在尝试用 Perl 编写Apriori 算法。最初我创建了一个哈希表来获取每个项目的频率,但是如何创建带有包含所有项目对的键的哈希表?我的意思是如何找到频繁项集?

这是我到目前为止写的代码。

open(IN,"dataset-1.txt");   
my %words;
while (my $line = <IN>)
{
   foreach my $word (split /\s+/, $line)
   {
        $words{$word}++;
   }
}
foreach my $word (keys %words)
{
    print "$word: $words{$word}\n";
}

我使用的数据集如下:

mango onion nintendo chain eggs
doll onion nintendo
mango apple chain
mango umbrella chain
corn chain cream eggs 

我要做的第二步是创建一个哈希表,其中所有可能的项目组合作为键和值应该是每行中项目集的频率。你能帮我吗?

4

1 回答 1

2

要回答有关如何获取所有可能的项目组合的问题,您可以使用Algorithm::Combinatorics模块。

#!/usr/bin/perl
use warnings;
use strict;
use Algorithm::Combinatorics qw(subsets);

my %freq; #save the frequencies
open (my $IN,'<','dataset-1.txt') or die $!;
while (<$IN>) {
    chomp;
    my @words=sort split/\s+/;
    my @itemsets=subsets(\@words); #we take every subset of the items of each row
    for (@itemsets) {
        $freq{join(',',@$_)}++;
    }
}
close $IN;

for (sort {$freq{$b}<=>$freq{$a}} keys %freq) { #print the frequency
    next if $_ eq ''; #skip empty itemset if present
    print "$_=>$freq{$_}\n";
}

对于您的数据集,这将打印:

chain=>4
mango=>3
chain,mango=>3
onion=>2
chain,eggs=>2
eggs=>2
nintendo,onion=>2
nintendo=>2
chain,eggs,mango,onion=>1
chain,eggs,mango=>1
eggs,nintendo=>1
cream,eggs=>1
chain,corn=>1
doll,onion=>1
chain,umbrella=>1
eggs,nintendo,onion=>1
chain,cream=>1
eggs,mango=>1
eggs,mango,onion=>1
chain,corn,eggs=>1
corn,cream,eggs=>1
umbrella=>1
mango,nintendo,onion=>1
chain,eggs,nintendo=>1
mango,nintendo=>1
chain,cream,eggs=>1
apple,mango=>1
chain,nintendo,onion=>1
chain,corn,cream,eggs=>1
doll,nintendo,onion=>1
chain,eggs,nintendo,onion=>1
chain,mango,nintendo=>1
chain,eggs,mango,nintendo=>1
corn=>1
cream=>1
doll=>1
chain,eggs,mango,nintendo,onion=>1
chain,mango,onion=>1
chain,onion=>1
mango,onion=>1
chain,corn,cream=>1
corn,cream=>1
doll,nintendo=>1
apple=>1
chain,mango,nintendo,onion=>1
apple,chain=>1
apple,chain,mango=>1
eggs,mango,nintendo,onion=>1
eggs,mango,nintendo=>1
corn,eggs=>1
mango,umbrella=>1
chain,mango,umbrella=>1
eggs,onion=>1
chain,nintendo=>1
chain,eggs,onion=>1

请注意,这不是Apriori 算法,它只是您问题的答案。但是知道如何进行组合应该很容易从那里得到它,定义一个支持阈值,清除频率较低的项集,然后继续。

现在,由于 Apriori 用于派生关联规则,因此您可以跳过所有痛苦并使用Tree::FP模块(它实现了频繁模式树)。我今天尝试了它,不幸的是发现它有问题(取决于数据集和定义的支持阈值可能会引发错误),但以下代码仍然有效并返回规则:

#!/usr/bin/perl
use warnings;
use strict;
use Tree::FP;
use POSIX qw(ceil);

my %words;
my $min_sup=20; #support is defined to find "large" itemsets. just an example, choose your own values!
my $min_conf=60; #confidence is defined to find "strong" associations. just an example, choose your own values!
{
    open (my $IN,'<','dataset-1.txt') or die $!;
    while (<$IN>) {
        chomp;
        $words{$_}++ for split/\s+/;
    }
    close $IN;
}
my @sorted=sort {$words{$b}<=>$words{$a}} keys %words; #we prune itemsets with frequencies inferior to the minimum support and sort them by frequency
my $fptree=Tree::FP->new(@sorted); #create a new Frequent-Pattern Tree
$fptree->set_support($min_sup/100); #set support
$fptree->set_confidence($min_conf/100); #set confidence (note: it actually doesn't filter the results as expected; known issue of the module)
{ #populate the tree
    open (my $IN,'<','dataset-1.txt') or die $!;
    while (<$IN>) {
        chomp;
        $fptree->insert_tree(split/\s+/) or die "Error while insert row $.: ",$fptree->err;
    }
    close $IN;
}
my @rules=$fptree->association_rules;
for (@rules) {
    next if $_->confidence < $fptree->confidence;
    print '{',join(',',$_->left),'} => {',join(',',$_->right),'} ',sprintf('support:%.2f, confidence:%.2f',$_->support,$_->confidence),"\n";
}
于 2013-10-21T02:46:43.687 回答