perl - Perl 从多个文件中删除停用词

Question

我已经阅读了很多关于如何从文件中删除停用词的表格，我的代码删除了许多其他内容，但我还想包括停用词。这就是我达到的程度，但我不知道我错过了什么。请指教

use Lingua::StopWords qw(getStopWords);
my $stopwords = getStopWords('en');

chdir("c:/perl/input");
@files = <*>;

foreach $file (@files) 
  {
    open (input, $file);

    while (<input>) 
      {
        open (output,">>c:/perl/normalized/".$file);
    chomp;
    #####What should I write here to remove the stop words#####
    $_ =~s/<[^>]*>//g;
    $_ =~ s/\s\.//g;
    $_ =~ s/[[:punct:]]\.//g;
    if($_ =~ m/(\w{4,})\./)
    {
    $_ =~ s/\.//g;
    }
    $_ =~ s/^\.//g;
    $_ =~ s/,/' '/g;
    $_ =~ s/\(||\)||\\||\/||-||\'//g;

    print output "$_\n";

      }
   }

close (input);
close (output);

score 2 · Accepted Answer

停用词是%$stopwords其值为 1 的键，即：

@stopwords = grep { $stopwords->{$_} } (keys %$stopwords);

停用词可能只是的键%$stopwords，但根据Lingua::StopWords文档，您还需要检查与键关联的值。

一旦你有了停用词，你可以使用如下代码删除它们：

# remove all occurrences of @stopwords from $_

for my $w (@stopwords) {
  s/\b\Q$w\E\b//ig;
}

请注意使用\Q...\E引用可能出现在停用词中的任何正则表达式元字符。尽管停用词不太可能包含元字符，但在您想在正则表达式中表示文字字符串时，这是一个很好的做法。

我们还使用\b来匹配单词边界。这有助于确保我们不会在另一个单词中间出现停用词。希望这对您有用-这在很大程度上取决于您输入的文本是什么样的-即您是否有标点符号等。

score 0 · Accepted Answer

# Always use these in your Perl programs.
use strict;
use warnings;

use File::Basename qw(basename);
use Lingua::StopWords qw(getStopWords);

# It's often better to build scripts that take their input
# and output locations as command-line arguments rather than
# being hard-coded in the program.
my $input_dir   = shift @ARGV;
my $output_dir  = shift @ARGV;
my @input_files = glob "$input_dir/*";

# Convert the hash ref of stop words to a regular array.
# Also quote any regex characters in the stop words.
my @stop_words  = map quotemeta, keys %{getStopWords('en')};

for my $infile (@input_files){
    # Open both input and output files at the outset.
    # Your posted code reopened the output file for each line of input.
    my $fname   = basename $infile;
    my $outfile = "$output_dir/$fname";
    open(my $fh_in,  '<', $infile)  or die "$!: $infile";
    open(my $fh_out, '>', $outfile) or die "$!: $outfile";

    # Process the data: you need to iterate over all stop words
    # for each line of input.
    while (my $line = <$fh_in>){
        $line =~ s/\b$_\b//ig for @stop_words;
        print $fh_out $line;
    }

    # Close the files within the processing loop, not outside of it.
    close $fh_in;
    close $fh_out;
}

perl - Perl 从多个文件中删除停用词

2 回答 2

Related

Reference