regex - 将多段落文档拆分为段落编号的句子

Question

我有一个解析良好的多段落文档列表（所有段落由\n\n分隔，句子由“。”分隔），我想将其拆分为句子，以及一个表示段落编号的数字文档。例如，（两段）输入是：

First sentence of the 1st paragraph. Second sentence of the 1st paragraph. \n\n 

First sentence of the 2nd paragraph. Second sentence of the 2nd paragraph. \n\n

理想情况下，输出应该是：

1 First sentence of the 1st paragraph. 

1 Second sentence of the 1st paragraph. 

2 First sentence of the 2nd paragraph.

2 Second sentence of the 2nd paragraph.

我熟悉 Perl 中的 Lingua::Sentences 包，它可以将文档分成句子。但是它与段落编号不兼容。因此，我想知道是否有其他方法可以实现上述目标（文档不包含缩写）。任何帮助是极大的赞赏。谢谢！

score 5 · Accepted Answer

如果您可以依靠句点.作为分隔符，您可以这样做：

perl -00 -nlwe 'print qq($. $_) for split /(?<=\.)/' yourfile.txt

解释：

-00将输入记录分隔符设置为空字符串，即段落模式。
-l将输出记录分隔符设置为输入记录分隔符，在这种情况下转换为两个换行符。

然后我们简单地用一个lookbehind断言分割句点并打印句子，前面是行号。

score 2 · Accepted Answer

正如你所提到Lingua::Sentences的，我认为这是一个选项来稍微操纵这个模块的原始输出以获得你需要的东西

use Lingua::Sentence;

my @paragraphs = split /\n{2,}/, $splitter->split($text);

foreach my $index (0..$#paragraphs) {
    my $paragraph = join "\n\n", map { $index+1 . " $_" } 
        split /\n/, $paragraphs[$index];
    print "$paragraph\n\n";
}

regex - 将多段落文档拆分为段落编号的句子

2 回答 2

Related

Reference