perl - 如何从大型文本文件中删除停用词？

Question

我有十亿字的语料库，我以标量形式收集。我有一个 .regex 文件，其中包含我想从数据（文本）中删除的所有停用词。

我不知道如何使用这个 .regex 文件，所以我做了一个数组并将 .regex 文件的所有停用词存储在我的停用词数组中。

要删除停用词，我会执行以下操作：

grep { $scalarText =~ s/\b\Q$_\E\b/ /g } @stopList;

这需要很长时间才能执行。如何在我的 Perl 脚本中使用 .regex 文件来删除停用词？或者有没有更快的方法来删除停用词？

score 5 · Accepted Answer

是的，我想你在那里做的事情非常缓慢，尽管有几个原因。我认为您需要先处理停用词正则表达式，然后再从语料库中建立十亿个单词的字符串。

我不知道 .regex 文件是什么，但我假设它包含一个合法的 Perl 正则表达式，您可以使用以下内容进行编译：

$stopword_string = `cat foo.regex`;
$stopword_rx     = qr/$stopword_string/;

这可能假设一(?x)开始就有一个。

但是，如果您的停用词文件是行列表，则需要执行以下操作：

chomp(@stopwords = `cat foo.regex`);

# if each stopword is an independent regex:
$stopword_string = join "|" => @stopwords;

# else if each stopword is a literal
$stopword_string = join "|" => map {quotemeta} @stopwords;

# now compile it (maybe add some qr//OPTS)
$stopword_rx     = qr/\b(?:$stopword_string)\b/;

警告

要非常小心：如果第一个单词中的第一个字符和最后一个单词中的最后一个字符是 alphanumunder（一个字符）\b，它只会执行您在上面认为的操作。\w否则，它将断言您可能不是故意的。如果有可能，您将需要更具体。前导\b需要变为(?:(?<=\A)|(?<=\s))，尾随\b需要变为(?=\s|\z)。这就是大多数人认为 \b的意思，但事实并非如此。

完成此操作后，您应该在阅读语料库时将停用词正则表达式应用于语料库。最好的方法是不要首先将内容放入您的字符串中，您以后只需要取出.

所以而不是做

$corpus_text = `cat some-giant-file`;
$corpus_text =~ s/$stopword_rx//g;

而是做

my $corpus_path = "/some/path/goes/here";
open(my $corpus_fh, "< :encoding(UTF-8)", $corpus_path)
    || die "$0: couldn't open $corpus_path: $!";

my $corpus_text = q##;

while (<$corpus_fh>) {
    chomp;  # or not
    $corpus_text .= $_ unless /$stopword_rx/;
}

close($corpus_fh)
    || die "$0: couldn't close $corpus_path: $!";

这比把东西放在那里要快得多，你以后只需要再次清除。

我对cat上面的使用只是一个捷径。我不希望您实际调用程序，至少cat，只是为了读入一个未处理且不受干扰的文件。☺</p>

score 2 · Accepted Answer

2

您可能想使用Regexp::Assemble将 Perl 正则表达式列表编译成一个正则表达式。

于 2010-10-31T21:39:58.527 回答

score 0 · Accepted Answer

我找到了一种更快的方法。为我节省了大约 4 秒。

my $qrstring = '\b(' . (join '|', @stopList) . ')\b';
$scalarText =~ s/$qrstring/ /g;

stopList我所有单词的数组在哪里，scalarText是我的整个文本。

如果你知道的话，谁能告诉我一个更快的方法吗？

perl - 如何从大型文本文件中删除停用词？

3 回答 3

警告

Related

Reference