1

我正在为我正在编写的网络爬虫尝试情绪分类(不要问我关于重新发明轮子的问题,至少就爬虫而言!)。我目前正在研究使用朴素贝叶斯,主要是因为存在一个 perl 模块以使其更容易。然而,我在设置一个测试用例来确定好/坏电影评论时遇到了一些问题。

据我了解,我首先需要拿出一组测试数据,模块将使用这些数据进行训练。我去了几个评论电影的网站,下载了十几个差评和十几个好评。我读入每个文件以生成一个单词列表,然后将其转换为单词频率的散列。然后我对一堆“未知”评论进行相同的处理(尽管我或多或少知道情绪),但我遇到了问题,我不确定我是否以错误的方式处理这个问题!

这是测试输出:

*** Processing good reviews
Reading set/good/anniehall.txt
Reading set/good/biglebowski.txt
Reading set/good/contact.txt
Reading set/good/eternalsunshine.txt
Reading set/good/harakiri.txt
Reading set/good/killing.txt
Reading set/good/lincoln.txt
Reading set/good/mulhollanddr.txt
Reading set/good/narayama.txt
Reading set/good/scarface.txt
Reading set/good/seven.txt
Reading set/good/shoah.txt
Reading set/good/spiritedaway.txt
*** Processing bad reviews
Reading set/bad/battlefieldearth.txt
Reading set/bad/charliesangels.txt
Reading set/bad/deathsmootchy.txt
Reading set/bad/deucebigaloweuro.txt
Reading set/bad/freddyfingered.txt
Reading set/bad/humancentipede.txt
Reading set/bad/jasonx.txt
Reading set/bad/north.txt
Reading set/bad/pootietang.txt
Reading set/bad/residentevilapocalypse.txt
Reading set/bad/savingsilverman.txt
Reading set/bad/slackers.txt
Reading set/bad/texaschainsaw.txt
*** Predicting unknown reviews
set/unknown/benjaminbutton.txt: $VAR1 = {
          'bad' => '1.06973342245912e-68',
          'good' => '1'
        };
set/unknown/epic.txt: $VAR1 = {
          'good' => '1',
          'bad' => '7.2271232924459e-35'
        };
set/unknown/hangoverpart3.txt: $VAR1 = {
          'good' => '1',
          'bad' => '1.08569835047604e-17'
        };
set/unknown/jacobsladder.txt: $VAR1 = {
          'good' => '1',
          'bad' => '9.31582505503138e-60'
        };
set/unknown/marleyme.txt: $VAR1 = {
          'good' => '1',
          'bad' => '5.57603799052706e-26'
        };
set/unknown/quantumofsolace.txt: $VAR1 = {
          'bad' => '2.40424666202666e-27',
          'good' => '1'
        };
set/unknown/thespirit.txt: $VAR1 = {
          'bad' => '2.47177895177767e-19',
          'good' => '1'
        };
set/unknown/twilight.txt: $VAR1 = {
          'good' => '1',
          'bad' => '9.77187340648713e-62'
        };

似乎它总是将未知数据标记为“好”!

这是程序本身:

use 5.010;
use strict;
use warnings;
use utf8;
use Data::Dumper;

BEGIN { push @INC, "../lib"; }
use Algorithm::NaiveBayes;

my $nb = Algorithm::NaiveBayes->new;

# For each file in each directory, retrieve a hash with each key being a
# unique word, and each value the associated frequency of that word.

# Start with scanning good reviews directory
say "*** Processing good reviews";
my @files = <set/good/*>;
foreach (@files) {
    next if ($_ =~ m/^\./); # ignore files beginning with .
    say "Reading $_";
    my %attr = hash_file($_);
    $nb->add_instance ( attributes => \%attr, label => 'good');
}

# Then scan bad reviews
say "*** Processing bad reviews";
@files = <set/bad/*>;
foreach (@files) {
    next if ($_ =~ m/^\./); # ignore files beginning with .
    say "Reading $_";
    my %attr = hash_file($_);
    $nb->add_instance ( attributes => \%attr, label => 'bad');
}

# Train, and cross fingers
$nb->train;

# Test unknown reviews
say "*** Predicting unknown reviews";
@files = <set/unknown/*>;
foreach (@files) {
    next if ($_ =~ m/^\./); # ignore files beginning with .
    print "$_: ";
    my %attr = hash_file($_);
    my $result = $nb->predict(attributes => \%attr);
    print Dumper($result);
}


# Subroutine that takes a file path and returns a hash of the word frequencies
sub hash_file {
    my ($file) = @_;
    my %words;

    my @word_list;
    open FILE, $file or die $!;
    while(<FILE>){
        chomp;
        push @word_list, split;
    }
    close FILE;

    foreach (@word_list){
        $_ =~ s/[[:punct:]]//g; # Remove punctuation
        next if ($_ eq '');

        # Increment frequency if word is in hash, or add to hash
        if (exists $words{$_} ){
            $words{$_}++;
        } else {
            $words{$_} = 1;
        }

    }
    return %words;
}

我希望该代码中有一些明显的错误,但我检查了哈希子例程,它似乎吐出了正确的哈希值。我唯一能想到的另一件事是,也许我没有使用足够的数据来训练它?也许我的整个方法都被误导了?

感谢您的任何见解

4

0 回答 0