regex - 使用 Perl 从字符串中剥离除 HTML 锚链接之外的所有内容

Question

使用 Perl，我如何使用正则表达式来获取其中包含随机 HTML 的字符串以及一个带有锚点的 HTML 链接，如下所示：

  <a href="http://example.com" target="_blank">Whatever Example</a>

它只留下那个并摆脱其他一切？无论带有 <a、like title=、or或其他什么的 href 属性内有style=什么。它离开了锚点：“Whatever Example”和 </a>?

score 2 · Accepted Answer

您可以利用流解析器，例如HTML::TokeParser::Simple：

#!/usr/bin/env perl

use strict;
use warnings;

use HTML::TokeParser::Simple;

my $html = <<EO_HTML;

Using Perl, how can I use a regex to take a string that has random HTML in it
with one HTML link with anchor, like this:

   <a href="http://example.com" target="_blank">Whatever <i>Interesting</i> Example</a>

       and it leave ONLY that and get rid of everything else? No matter what
   was inside the href attribute with the <a, like title=, or style=, or
   whatever. and it leave the anchor: "Whatever Example" and the </a>?
EO_HTML

my $parser = HTML::TokeParser::Simple->new(string => $html);

while (my $tag = $parser->get_tag('a')) {
    print $tag->as_is, $parser->get_text('/a'), "</a>\n";
}

输出：

$ ./whatever.pl
<a href="http://example.com" target="_blank">任何有趣的例子</a>

score 1 · Accepted Answer

如果您需要一个简单的正则表达式解决方案，一种天真的方法可能是：

my @anchors = $text =~ m@(<a[^>]*?>.*?</a>)@gsi;

但是，正如@dan1111 所提到的，由于各种原因，正则表达式不是解析 HTML 的正确工具。

如果您需要可靠的解决方案，请寻找HTML 解析器模块。

regex - 使用 Perl 从字符串中剥离除 HTML 锚链接之外的所有内容

2 回答 2

Related

Reference