我正在为我正在编写的网络爬虫尝试情绪分类(不要问我关于重新发明轮子的问题,至少就爬虫而言!)。我目前正在研究使用朴素贝叶斯,主要是因为存在一个 perl 模块以使其更容易。然而,我在设置一个测试用例来确定好/坏电影评论时遇到了一些问题。
据我了解,我首先需要拿出一组测试数据,模块将使用这些数据进行训练。我去了几个评论电影的网站,下载了十几个差评和十几个好评。我读入每个文件以生成一个单词列表,然后将其转换为单词频率的散列。然后我对一堆“未知”评论进行相同的处理(尽管我或多或少知道情绪),但我遇到了问题,我不确定我是否以错误的方式处理这个问题!
这是测试输出:
*** Processing good reviews
Reading set/good/anniehall.txt
Reading set/good/biglebowski.txt
Reading set/good/contact.txt
Reading set/good/eternalsunshine.txt
Reading set/good/harakiri.txt
Reading set/good/killing.txt
Reading set/good/lincoln.txt
Reading set/good/mulhollanddr.txt
Reading set/good/narayama.txt
Reading set/good/scarface.txt
Reading set/good/seven.txt
Reading set/good/shoah.txt
Reading set/good/spiritedaway.txt
*** Processing bad reviews
Reading set/bad/battlefieldearth.txt
Reading set/bad/charliesangels.txt
Reading set/bad/deathsmootchy.txt
Reading set/bad/deucebigaloweuro.txt
Reading set/bad/freddyfingered.txt
Reading set/bad/humancentipede.txt
Reading set/bad/jasonx.txt
Reading set/bad/north.txt
Reading set/bad/pootietang.txt
Reading set/bad/residentevilapocalypse.txt
Reading set/bad/savingsilverman.txt
Reading set/bad/slackers.txt
Reading set/bad/texaschainsaw.txt
*** Predicting unknown reviews
set/unknown/benjaminbutton.txt: $VAR1 = {
'bad' => '1.06973342245912e-68',
'good' => '1'
};
set/unknown/epic.txt: $VAR1 = {
'good' => '1',
'bad' => '7.2271232924459e-35'
};
set/unknown/hangoverpart3.txt: $VAR1 = {
'good' => '1',
'bad' => '1.08569835047604e-17'
};
set/unknown/jacobsladder.txt: $VAR1 = {
'good' => '1',
'bad' => '9.31582505503138e-60'
};
set/unknown/marleyme.txt: $VAR1 = {
'good' => '1',
'bad' => '5.57603799052706e-26'
};
set/unknown/quantumofsolace.txt: $VAR1 = {
'bad' => '2.40424666202666e-27',
'good' => '1'
};
set/unknown/thespirit.txt: $VAR1 = {
'bad' => '2.47177895177767e-19',
'good' => '1'
};
set/unknown/twilight.txt: $VAR1 = {
'good' => '1',
'bad' => '9.77187340648713e-62'
};
似乎它总是将未知数据标记为“好”!
这是程序本身:
use 5.010;
use strict;
use warnings;
use utf8;
use Data::Dumper;
BEGIN { push @INC, "../lib"; }
use Algorithm::NaiveBayes;
my $nb = Algorithm::NaiveBayes->new;
# For each file in each directory, retrieve a hash with each key being a
# unique word, and each value the associated frequency of that word.
# Start with scanning good reviews directory
say "*** Processing good reviews";
my @files = <set/good/*>;
foreach (@files) {
next if ($_ =~ m/^\./); # ignore files beginning with .
say "Reading $_";
my %attr = hash_file($_);
$nb->add_instance ( attributes => \%attr, label => 'good');
}
# Then scan bad reviews
say "*** Processing bad reviews";
@files = <set/bad/*>;
foreach (@files) {
next if ($_ =~ m/^\./); # ignore files beginning with .
say "Reading $_";
my %attr = hash_file($_);
$nb->add_instance ( attributes => \%attr, label => 'bad');
}
# Train, and cross fingers
$nb->train;
# Test unknown reviews
say "*** Predicting unknown reviews";
@files = <set/unknown/*>;
foreach (@files) {
next if ($_ =~ m/^\./); # ignore files beginning with .
print "$_: ";
my %attr = hash_file($_);
my $result = $nb->predict(attributes => \%attr);
print Dumper($result);
}
# Subroutine that takes a file path and returns a hash of the word frequencies
sub hash_file {
my ($file) = @_;
my %words;
my @word_list;
open FILE, $file or die $!;
while(<FILE>){
chomp;
push @word_list, split;
}
close FILE;
foreach (@word_list){
$_ =~ s/[[:punct:]]//g; # Remove punctuation
next if ($_ eq '');
# Increment frequency if word is in hash, or add to hash
if (exists $words{$_} ){
$words{$_}++;
} else {
$words{$_} = 1;
}
}
return %words;
}
我希望该代码中有一些明显的错误,但我检查了哈希子例程,它似乎吐出了正确的哈希值。我唯一能想到的另一件事是,也许我没有使用足够的数据来训练它?也许我的整个方法都被误导了?
感谢您的任何见解