html - 正则表达式来解析 html 的句子？

Question

我知道 HTML:Parser 是一回事，通过阅读，我意识到尝试使用正则表达式解析 html 通常是一种次优的做事方式，但是对于 Perl 类，我目前正在尝试使用正则表达式（希望只是一个匹配）来识别和存储保存的 html 文档中的句子。最终，我希望能够计算出句子的数量、单词/句子以及页面上单词的平均长度。

目前，我只是尝试隔离“>”之后和“.”之前的内容，只是为了看看它隔离了什么，但我无法让代码运行，即使在操作正则表达式时也是如此。所以我不确定问题是在正则表达式中，还是在其他地方，或者两者兼而有之。任何帮助，将不胜感激！

#!/usr/bin/perl
#new
use CGI qw(:standard);
print header;

open FILE, "< sample.html ";
$html = join('', <FILE>);
close FILE;

print "<pre>";

###Main Program###
&sentences;

###sentence identifier sub###

sub sentences {
@sentences;
while ($html =~ />[^<]\. /gis) {
    push @sentences, $1;
}
#for debugging, comment out when running    
    print join("\n",@sentences);
}

print "</pre>";

score 3 · Accepted Answer

你的正则表达式应该是/>[^<]*?./gis

*?手段匹配零个或多个非贪婪。就目前而言，您的正则表达式将仅匹配一个非 < 字符，后跟一个句点和一个空格。这样，它将匹配所有非 < 直到第一个句点。

可能还有其他问题。

现在读这个

score 2 · Accepted Answer

第一个改进是写作$html =~ />([^<.]+)\. /gs，你需要捕捉与父母的匹配，并允许每个句子超过 1 个字母；--)

但是，这并没有得到所有的句子，只是每个元素中的第一个。

更好的方法是捕获所有文本，然后从每个片段中提取句子

while( $html=~ m{>([^<]*<}g) { push @text_content, $1}; 
foreach (@text_content) { while( m{([^.]*)\.}gs) { push @sentences, $1; } }

（未经测试，因为它是清晨和咖啡在召唤）

所有关于使用正则表达式解析 HTML 的常见警告都适用，最值得注意的是文本中存在“>”。

score 0 · Accepted Answer

我认为这或多或少可以满足您的需求。请记住，此脚本仅查看 p 标签内的文本。文件名作为命令行参数 (shift) 传入。

#!/usr/bin/perl

 use strict;
 use warnings;
 use HTML::Grabber;

 my $file_location = shift;
 print "\n\nfile: $file_location";
 my $totalWordCount = 0;
 my $sentenceCount = 0;
 my $wordsInSentenceCount = 0;
 my $averageWordsPerSentence = 0;
 my $char_count = 0;
 my $contents;
 my $rounded;
 my $rounded2;

 open ( my $file, '<', $file_location  ) or die "cannot open < file: $!";

    while( my $line = <$file>){
          $contents .= $line;
  }      
 close( $file );
 my $dom = HTML::Grabber->new( html => $contents );

 $dom->find('p')->each( sub{
    my $p_tag = $_->text;

    ++$totalWordCount while $p_tag =~ /\S+/g;


    while ($p_tag =~ /[.!?]+/g){
              $p_tag =~ s/\s//g;
              $char_count += (length($p_tag));
              $sentenceCount++;  
          }
     });     


           print "\n Total Words: $totalWordCount\n";
           print " Total Sentences: $sentenceCount\n";
           $rounded = $totalWordCount / $sentenceCount;
           print  " Average words per sentence: $rounded.\n\n";
           print " Total Characters: $char_count.\n";
           my $averageCharsPerWord = $char_count / $totalWordCount  ;

           $rounded2 = sprintf("%.2f", $averageCharsPerWord );

           print  " Average words per sentence: $rounded2.\n\n";

html - 正则表达式来解析 html 的句子？

3 回答 3

Related

Reference