html - Perl正则表达式从嵌套的html标签中提取值

Question

$match = q(<a href="#google"><h1><b>Google</b></h1></a>);
if($match =~ /<a.*?href.*?><.?>(.*?)<\/a>/){
$title = $1;
}else {
$title="";
}
print"$title";

输出：Google</b></h1>

它应该是：Google

无法在 Perl 中使用正则表达式从链接中提取值，它可能有或多或少的嵌套：

<h1><b><i>Google</i></b></h1>

请试试这个：

1) <td><a href="/wiki/Unix_shell" title="Unix shell">Unix shell</a>

2) <a href="http://www.hp.com"><h1><b>惠普</b></h1></a>

3) <a href="/wiki/Generic_programming" title="通用编程">通用</a></td>)；

4) <a href="#cite_note-1"><span>[</span>1<span>]</span></a>

输出：

Unix 外壳

生命值

通用的

[1]

score 5 · Accepted Answer

如评论中所述，不要使用正则表达式。我特别喜欢Mojo 套件，它允许我使用 CSS 选择器：

use Mojo;

my $dom = Mojo::DOM->new(q(<a href="#google"><h1><b>Google</b></h1></a>));

print $dom->at('a[href="#google"]')->all_text, "\n";

或与HTML::TreeBuilder::XPath：

use HTML::TreeBuilder::XPath;

my $dom = HTML::TreeBuilder::XPath->new_from_content(q(<a href="#google"><h1><b>Google</b></h1></a>));

print $dom->findvalue('//a[@href="#google"]'), "\n";

score 2 · Accepted Answer

尝试这个：

if($match =~ /<a.*?href.*?><b>(.*?)<\/b>/)

这应该采取“标签href之间和之后的所有内容<b>...</b>

相反，要获得“最后一个之后>和第一个之前的所有内容</，您可以使用

<a.*?href.*?>([^>]*?)<\/

score 0 · Accepted Answer

~~对于这个简单的案例，您可以使用：~~要求不再简单，请查看@amon 的答案，了解如何使用 HTML 解析器。

/<a.*?>([^<]+)</

匹配一个开始a标签，后跟任何内容，直到找到介于>和之间的内容<。

尽管正如其他人所提到的，您通常应该使用 HTML 解析器。

echo '<td><a href="/wiki/Unix_shell" title="Unix shell">Unix shell</a>
<a href="http://www.hp.com"><h1><b>HP</b></h1></a>
<a href="/wiki/Generic_programming" title="Generic programming">generic</a></td>);' | perl -ne '/<a.*?>([^<]+)</; print "$1\n"'
Unix shell
HP
generic

score 0 · Accepted Answer

我想出了这个适用于 PCRE 下所有采样输入的正则表达式。此正则表达式等效于具有尾递归模式 (?1)* 的常规语法

(?<=>)((?:\w+)(?:\s*))(?1)*

只取返回数组的第一个元素，即array[0]

html - Perl正则表达式从嵌套的html标签中提取值

4 回答 4

Related

Reference