1

我试图阻止英文文本,我阅读了很多论坛,但我看不到一个明确的例子。我正在使用搬运工词干分析器,就像使用 Text::ENglish 一样。这是我走了多远:

    use Lingua::StopWords qw(getStopWords);
    my $stopwords = getStopWords('en');
    use Text::English;

    @stopwords = grep { $stopwords->{$_} } (keys %$stopwords);

    chdir("c:/Test Facility/input");
    @files = <*>;

    foreach $file (@files) 
      {
        open (input, $file);

        while (<input>) 
          {
            open (output,">>c:/Test Facility/normalized/".$file);
        chomp;
        for my $w (@stopwords) 
        {
        s/\b\Q$w\E\b//ig;
        }
        $_ =~s/<[^>]*>//g;
        $_ =~ s/[[:punct:]]//g;
        ##What should I write here to apply porter stemming using Text::English##
        print output "$_\n";

          }

       }
    close (input);
    close (output);
4

1 回答 1

1

像这样运行以下代码:

perl stemmer.pl /usr/lib/jvm/java-6-sun-1.6.0.26/jre/LICENSE

它产生类似于以下的输出:

operat system distributor licens java version sun microsystems inc sun willing to license java platform standard edition developer kit jdk

请注意,除了停用词之外,长度为 1 和数值的字符串也将被删除。

#!/usr/bin/env perl
use common::sense;

use Encode;
use Lingua::Stem::Snowball;
use Lingua::StopWords qw(getStopWords);
use Scalar::Util qw(looks_like_number);

my $stemmer = Lingua::Stem::Snowball->new(
    encoding    => 'UTF-8',
    lang        => 'en',
);

my %stopwords = map {
    lc
} keys %{getStopWords(en => 'UTF-8')};

local $, = ' ';
say map {
    sub {
        my @w =
            map {
                encode_utf8 $_
            } grep {
                length >= 2
                and not looks_like_number($_)
                and not exists $stopwords{lc($_)}
            } split
                /[\W_]+/x,
                shift;

        $stemmer->stem_in_place(\@w);

        map {
            lc decode_utf8 $_
        } @w
    }->($_);
} <>;
于 2012-11-13T20:27:33.740 回答