perl - Perl 段落 n-gram

Question

假设我有一段文字：

$body = 'the quick brown fox jumps over the lazy dog';

我想把那个句子变成“关键字”的散列，但我想允许多字关键字；我有以下获取单个单词的关键字：

$words{$_}++ for $body =~ m/(\w+)/g;

完成后，我有一个如下所示的哈希：

'the' => 2,
'quick' => 1,
'brown' => 1,
'fox' => 1,
'jumps' => 1,
'over' => 1,
'lazy' => 1,
'dog' => 1

下一步，以便我可以获得 2 个单词的关键字，如下所示：

$words{$_}++ for $body =~ m/(\w+ \w+)/g;

但这只会得到每个“其他”对；看起来像这样：

'the quick' => 1,
'brown fox' => 1,
'jumps over' => 1,
'the lazy' => 1

我还需要一个词的偏移量：

'quick brown' => 1,
'fox jumps' => 1,
'over the' => 1

有比以下更简单的方法吗？

my $orig_body = $body;
# single word keywords
$words{$_}++ for $body =~ m/(\w+)/g;
# double word keywords
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
$body =~ s/^(\w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
$body = $orig_body;
# triple word keywords
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
$body =~ s/^(\w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
$body = $orig_body;
$body =~ s/^(\w+ \w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;

score 5 · Accepted Answer

虽然所描述的任务可能对手工编码很有趣，但使用现有的处理 n-gram 的 CPAN 模块不是更好吗？看起来Text::Ngrams（相对于Text::Ngram）可以处理基于单词的 n-gram 分析。

score 3 · Accepted Answer

你可以用前瞻做一些时髦的事情：

如果我做：

$words{$_}++ for $body =~ m/(?=(\w+ \w+))\w+/g;

该表达式表示要向前看两个单词（并捕获它们），但要消耗 1。

我得到：

%words: {
          'brown fox' => 1,
          'fox jumps' => 1,
          'jumps over' => 1,
          'lazy dog' => 1,
          'over the' => 1,
          'quick brown' => 1,
          'the lazy' => 1,
          'the quick' => 1
        }

看来我可以通过为计数放入一个变量来概括这一点：

my $n    = 4;
$words{$_}++ for $body =~ m/(?=(\w+(?: \w+){$n}))\w+/g;

score 2 · Accepted Answer

我会使用前瞻来收集除第一个单词之外的所有内容。这样，位置会自动正确前进：

my $body = 'the quick brown fox jumps over the lazy dog';

my %words;

++$words{$1}         while $body =~ m/(\w+)/g;
++$words{"$1 $2"}    while $body =~ m/(\w+) \s+ (?= (\w+) )/gx;
++$words{"$1 $2 $3"} while $body =~ m/(\w+) \s+ (?= (\w+) \s+ (\w+) )/gx;

如果您想坚持使用单个空格而不是\s+（如果这样做，请不要忘记删除/x修饰符），您可以稍微简化它，因为您可以在中收集任意数量的单词$2，而不是每个单词使用一组。

score 2 · Accepted Answer

单独使用正则表达式有什么特别的原因吗？对我来说显而易见的方法是将split文本放入数组中，然后使用一对嵌套循环从中提取计数。类似于以下内容：

#!/usr/bin/env perl

use strict;
use warnings;

my $text = 'the quick brown fox jumps over the lazy dog';
my $max_words = 3;

my @words = split / /, $text;
my %counts;

for my $pos (0 .. $#words) {
  for my $phrase_len (0 .. ($pos >= $max_words ? $max_words - 1 : $pos)) {
    my $phrase = join ' ', @words[($pos - $phrase_len) .. $pos];
    $counts{$phrase}++;
  }
} 

use Data::Dumper;
print Dumper(\%counts);

输出：

$VAR1 = {
          'over the lazy' => 1,
          'the' => 2,
          'over' => 1,
          'brown fox jumps' => 1,
          'brown fox' => 1,
          'the lazy dog' => 1,
          'jumps over' => 1,
          'the lazy' => 1,
          'the quick brown' => 1,
          'fox jumps' => 1,
          'over the' => 1,
          'brown' => 1,
          'fox jumps over' => 1,
          'quick brown' => 1,
          'jumps' => 1,
          'lazy' => 1,
          'jumps over the' => 1,
          'lazy dog' => 1,
          'dog' => 1,
          'quick brown fox' => 1,
          'fox' => 1,
          'the quick' => 1,
          'quick' => 1
        };

编辑：根据 cjm 的评论，固定$phrase_len循环以防止使用导致不正确结果的负索引。

score 1 · Accepted Answer

使用pos运算符

pos 标量

返回上次m//g搜索所讨论变量的位置的偏移量（$_在未指定变量时使用）。

和@-特殊的数组

@LAST_MATCH_START

@-

$-[0]是最后一次成功匹配开始的偏移量。$-[n]是第n个子模式匹配的子字符串开头的偏移量，或者undef如果子模式不匹配。

例如，下面的程序在其自己的捕获中抓取每对的第二个单词并倒回匹配的位置，因此第二个单词将成为下一对的第一个单词：

#! /usr/bin/perl

use warnings;
use strict;

my $body = 'the quick brown fox jumps over the lazy dog';

my %words;
while ($body =~ /(\w+ (\w+))/g) {
  ++$words{$1};
  pos($body) = $-[2];
}

for (sort { index($body,$a) <=> index($body,$b) } keys %words) {
  print "'$_' => $words{$_}\n";
}

输出：

'快速' => 1
'快速棕色' => 1
'棕狐' => 1
'狐狸跳' => 1
'跳过' => 1
'超过' => 1
'懒惰' => 1
'懒狗' => 1

perl - Perl 段落 n-gram

5 回答 5

pos 标量

@LAST_MATCH_START

@-

Related

Reference