regex - 如何在 Perl 中从正文电子邮件中提取 href？

Question

我正在尝试提取一些 url，它可能不止一个，来自正文电子邮件。

我正在尝试使用以下方法解析网址：

use strict;
use warnings;
use Net::IMAP::Simple;
use Email::Simple;
use IO::Socket::SSL;

# here must be the connection to imap hidden for economize space

my $es = Email::Simple->new( join '', @{ $imap->get($i) } );
my $text = $es->body;
print $text;
my $matches = ($text =~/<a[^>]*href="([^"]*)"[^>]*>.*<\/a>/);
print $matches;

在 $text 我有下一个文本：

 --047d7b47229eb3d9f404e58fd90a
    Content-Type: text/plain; charset=ISO-8859-1

    Try1 <http://www.washingtonpost.com/>

    Try2 <http://www.thesun.co.uk/sol/homepage/>

    --047d7b47229eb3d9f404e58fd90a
    Content-Type: text/html; charset=ISO-8859-1

    <div dir="ltr"><a href="http://www.washingtonpost.com/">Try1</a><br><div><br></div><div><a href="http://www.thesun.co.uk/sol/homepage/">Try2</a><br></div></div>

    --047d7b47229eb3d9f404e58fd90a--

程序的输出，给了我一个1……就是这样。

有人可以帮忙吗？？

谢谢指教。

score 6 · Accepted Answer

Email::Simple 不适合 MIME 消息。改用Courriel。正则表达式不适合 HTML 解析。请改用Web::Query。

use Courriel qw();
use Web::Query qw();

my $email = Courriel->parse( text => join …);
my $html = $email->html_body_part;
my @url = Web::Query->new_from_html($html)->find('a[href]')->attr('href');
__END__
http://www.washingtonpost.com/
http://www.thesun.co.uk/sol/homepage/

score 2 · Accepted Answer

关于使用不同的电子邮件处理模块而不是使用正则表达式解析 HTML 的建议都很好，您绝对应该注意它。

但是还没有人解释为什么你的代码给你不正确的结果。

这是因为您在标量上下文中调用匹配运算符。在标量上下文中，它返回一个布尔值，指示匹配是否成功。因此，您得到的 1（真）。

要从正则表达式匹配中获取捕获，您需要在列表上下文中调用匹配运算符。这可以像这样简单：

my ($matches) = ($text =~/<a[^>]*href="([^"]*)"[^>]*>.*<\/a>/);

但是您可能会考虑使用数组，以防您想将 /g 添加到匹配运算符并获得多个匹配项。

my @matches = ($text =~/<a[^>]*href="([^"]*)"[^>]*>.*<\/a>/g);

regex - 如何在 Perl 中从正文电子邮件中提取 href？

2 回答 2

Related

Reference