perl - 使用 Perl 将段落转换为句子

Question

我正在做 Perl 编程。我需要阅读一段并将每个句子打印成一行。

有人知道该怎么做吗？

下面是我的代码：

#! /C:/Perl64/bin/perl.exe

use utf8;

if (! open(INPUT, '< text1.txt')){
die "cannot open input file: $!";
}

if (! open(OUTPUT, '> output.txt')){
die "cannot open input file: $!";
}

select OUTPUT;

while (<INPUT>){
print "$_";
}

close INPUT;
close OUTPUT;
select STDOUT;

score 6 · Accepted Answer

我将让 Perl 来处理文件名，而不是处理文件名。

这在多个层面上都非常粗糙，整个工作无疑是艰巨的。

句子.pl

#!/usr/bin/env perl
use strict;
use warnings;
use Lingua::EN::Sentence qw(get_sentences);

sub normalize
{
    my($str) = @_;
    $str =~ s/\n/ /gm;
    $str =~ s/\s\s+/ /gm;
    return $str;
}

{
    local $/ = "\n\n";
    while (<>)
    {
        chomp;
        print "Para: [[$_]]\n";
        my @sentences = split m/(?<=[.!?])\s+/m, $_;
        foreach my $sentence (@sentences)
        {
            $sentence = normalize $sentence;
            print "Ad Hoc Sentence: $sentence\n";
        }
        my $sref = get_sentences($_);
        foreach my $sentence (@$sref)
        {
            $sentence = normalize $sentence;
            print "Lingua Sentence: $sentence\n";
        }
    }
}

正split则表达式查找前面有句号（句点）、感叹号或问号的一个或多个空格，并匹配多行。后视(?<=[.!?])意味着标点符号与句子保持一致。该normalize函数只是将换行符展平为空格并将多个空格呈现为单个空格。（请注意，这不会正确识别括号中的句子。）这将被视为前一句的一部分，因为.后面没有空格。

样本输入

This is a paragraph with more than one sentence in it.  How many will be
determined later.  Mr. A. P. McDowney has been rather busy.  This
incomplete sentence will still be counted as one

This is the second paragraph.  With three sentences in it, it is a lot
less exciting than the first paragraph, but the middle sentence extends
over multiple lines and   there   is     some         wonky spacing too.
But 'tis time to finish.

样本输出

Para: [[This is a paragraph with more than one sentence in it.  How many will be
determined later.  Mr. A. P. McDowney has been rather busy.  This
incomplete sentence will still be counted as one]]
Ad Hoc Sentence: This is a paragraph with more than one sentence in it.
Ad Hoc Sentence: How many will be determined later.
Ad Hoc Sentence: Mr.
Ad Hoc Sentence: A.
Ad Hoc Sentence: P.
Ad Hoc Sentence: McDowney has been rather busy.
Ad Hoc Sentence: This incomplete sentence will still be counted as one
Lingua Sentence: This is a paragraph with more than one sentence in it.
Lingua Sentence: How many will be determined later.
Lingua Sentence: Mr. A. P. McDowney has been rather busy.
Lingua Sentence: This incomplete sentence will still be counted as one
Para: [[This is the second paragraph.  With three sentences in it, it is a lot
less exciting than the first paragraph, but the middle sentence extends
over multiple lines and   there   is     some         wonky spacing too.
But 'tis time to finish.
]]
Ad Hoc Sentence: This is the second paragraph.
Ad Hoc Sentence: With three sentences in it, it is a lot less exciting than the first paragraph, but the middle sentence extends over multiple lines and there is some wonky spacing too.
Ad Hoc Sentence: But 'tis time to finish.
Lingua Sentence: This is the second paragraph.
Lingua Sentence: With three sentences in it, it is a lot less exciting than the first paragraph, but the middle sentence extends over multiple lines and there is some wonky spacing too.
Lingua Sentence: But 'tis time to finish.

请注意如何Lingua::EN::Sentence设法处理“先生”。AP McDowney 比头脑简单的正则表达式更好。

score 4 · Accepted Answer

识别句子是非常困难的并且是特定于语言的。你需要帮助。也许Lingua::EN::Sentence是要走的路？

score -1 · Accepted Answer

如果将段落作为字符串给出，则可以split()将其拆分为标记句子结尾的字符。

例如：

my @sentences = split /[.?!]/, $paragraph;

perl - 使用 Perl 将段落转换为句子

3 回答 3

句子.pl

样本输入

样本输出

Related

Reference