1

我正在尝试使用 perl 解析出多行字符串,但我只得到匹配的数量。这是我正在解析的示例:

<div id="content-ZAJ9E" class="content">
        Wow, I love the new top bar, so much easier to navigate now :) Anywho, got a few other fixes I am working on as well. :) I hope you all like the new look.
</div>

我正在尝试使用以下代码将内容存储在字符串中:

@a = ($html =~ m/class="content">.*<\/div>/gs);
print "array A, size: ",  @a+0,  ", elements: ";
print join (" ", @a);
print "\n";

但它返回的不仅仅是div中的文本。有人可以指出我的正则表达式中的错误吗?

玛丽莎

4

3 回答 3

7

使用强大的 HTML 解析器:

use strictures;
use Web::Query qw();
my $w = Web::Query->new_from_html(<<'HTML');
<div id="content-ZAJ9E" class="content">
        Wow, I love the new top bar, so much easier to navigate now :) Anywho, got a few other fixes I am working on as well. :) I hope you all like the new look.
</div>
HTML
$w->find('div.content')->text

表达式返回Wow, I love the new top bar, so much easier to navigate now :) Anywho, got a few other fixes I am working on as well. :) I hope you all like the new look.

于 2012-06-21T14:06:52.727 回答
5

使用旨在解析 HTML 的东西,例如HTML::TreeBuilder::XPath

#!/usr/bin/env perl

use strict; use warnings;
use 5.014;
use HTML::TreeBuilder::XPath;
use YAML;

my $doc =<<EO_HTML;
<div id="content-ZAJ9E" class="content">
<!-- begin <div> -->
        Wow, I love the new top bar, so much easier to navigate now :)
        Anywho, got a few other fixes I am working on as well. :)
        I hope you all like the new look.
<!-- end </div> -->
<span class="extra">Here I am</span>
</div>
EO_HTML

use HTML::TreeBuilder::XPath;
my $tree= HTML::TreeBuilder::XPath->new;
$tree->store_comments(1);
$tree->parse($doc);

print Dump [ $tree->findvalues('//div[@class="content"]') ];
print Dump [ $tree->findvalues('//*[@class="extra"]') ];
print Dump [ $tree->findvalues('//comment()') ];

请注意不依赖自制正则表达式模式来处理输入的各种变化所提供的能力。

输出:

---
- '  Wow, I love the new top bar, so much easier to navigate now :) Anywho, got a few other fixes I am working on as well. :) I hope you all like the new look. Here I am '
---
- Here I am
---
- ' begin <div> '
- ' end </div> '
于 2012-06-21T14:13:46.477 回答
4

你只是匹配字符串,你没有解析出任何东西。如果你想在中间的文本div,你应该说:

$html =~ m/class="content">(.*)<\/div>/gs;
my $text = $1;
print $text;

您的匹配项将存储在$1变量中。如果有多个这样的实例,则div[class=content]需要这样的循环:

use strict; use warnings;
use Data::Dumper;

my $html = qq~<div id="content-ZAJ9E" class="content">
        Wow, I love the new top bar.
</div>
<div id="content-ZAJ9E" class="content">
        I still love it.
</div>
<div id="content-ZAJ9E" class="content">
        I cant get enough!
</div>
~;

my @matches;
# *? makes it non-greedy so it will only match to the first </div>
while ($html =~ m/class="content">(.*?)<\/div>/gs){ 
  my $group = $1;     
  $group =~ s/^\s+//; # strip whitespace at the beginning
  $group =~ s/\s+$//; # and the end

  push @matches, $group;
}
print Dumper \@matches;

我建议你看看perlreand perlretut


一些注意事项:

于 2012-06-21T13:32:00.730 回答