I use a regex to extract <img src="img.jpg">
tags
Here is my regex
my @accept = $message_body =~ /<img src=\"\S*\">/gi;
Now my regex fails when the img tag is like this: <img src="cid:img.jpg">
Can any one tell me why?
The greedyness of \"\S*\"
says that it'll match as many non space characters as possible before the last "
appears in the string. You could change this to \".*?\"
which will match any characters upto the next "
.
I would completely overhaul your expression so that it would avoid some other difficult HTML edge cases.
This expression will:
>
or something that looks like an attribute inside an embedded javascript functionsrc
like hrefsrc="somevalue"
(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\ssrc=['"]([^"]*)['"])
construct allows multiple attributes to appear in any order inside the img tag.<img\b(?=\s)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\ssrc=['"]([^"]*)['"])(?:[^>=]|='[^']*'|="[^"]*"|=[^'"\s]*)*\s?>
Live Example: http://www.rubular.com/r/bRmdy0YA0S
Sample Text
Note how the second image tag has some of the really difficult edge cases.
<img src="cid:img.jpg">
<img hrefsrc="NotMe.jpg" onmouseover=' src="NotTheMeEither.jpg" ; if ( 6 > x ) { funRotator(src) ; } ; ' src="cid:DifficultToFind.jpg">
Matches
[0][0] = <img src="cid:img.jpg">
[0][1] = cid:img.jpg
[1][0] = <img hrefsrc="NotMe.jpg" onmouseover=' src="NotTheMeEither.jpg" ; if ( 6 > x ) { funRotator(src) ; } ; ' src="cid:DifficultToFind.jpg">
[1][1] = cid:DifficultToFind.jpg
The *
quantifier is greedy: it matches as much as it can while allowing the rest of the pattern to match. In your case, \S*
is likely consuming more text than you intended.
Consider using
my @accept = $message_body =~ /<img src="\S*?">/gi;
or
my @accept = $message_body =~ /<img src="[^"]+">/gi;
These patterns attempt to stop matching as soon as they detect a closing double-quote, but they are heuristics that could fail depending on how friendly your input is. To do the job properly, use an HTML parser.
In case you missed n0rd's comment, here is the essential link again about the use of regular expressions with (X|HT)ML.
With that out of the way, here is one way to do it with a module (of course, just as TIMTOWTDI, there is also more than one module that would be suitable)
#!/usr/bin/perl
use strict;
use warnings;
use autodie qw(open);
use HTML::TreeBuilder::XPath;
my $file = shift or die "Missing argument! Usage: $0 FILENAME\n";
open( F, $file );
my $t=HTML::TreeBuilder::XPath->new();
$t->parse_file($file)
or die "Could not parse $file\n";
foreach my $img ( $t->findnodes( '//img' ) ) {
my $src = $img->attr('src');
my $width = $img->attr('width');
my $height = $img->attr('height');
print $img->as_HTML, "\n";
foreach my $attr ( qw(src width height alt title) ) {
print "$attr = ", $img->attr($attr), "\n" if defined($img->attr($attr));
}
print "\n";
}