regex - Regex to capture tags fails when "src" value is different

Question

I use a regex to extract <img src="img.jpg"> tags

Here is my regex

my @accept = $message_body =~ /<img src=\"\S*\">/gi;

Now my regex fails when the img tag is like this: <img src="cid:img.jpg">

Can any one tell me why?

score 4 · Accepted Answer

Description

The greedyness of \"\S*\" says that it'll match as many non space characters as possible before the last " appears in the string. You could change this to \".*?\" which will match any characters upto the next ".

I would completely overhaul your expression so that it would avoid some other difficult HTML edge cases.

This expression will:

match img tags which have an src attribute
capture the src attribute value
avoid messy html edge cases like:
- like > or something that looks like an attribute inside an embedded javascript function
- attributes which end with src like hrefsrc="somevalue"
Although not used for this problem because you're only looking for a single attribute, the (?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\ssrc=['"]([^"]*)['"]) construct allows multiple attributes to appear in any order inside the img tag.

<img\b(?=\s)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\ssrc=['"]([^"]*)['"])(?:[^>=]|='[^']*'|="[^"]*"|=[^'"\s]*)*\s?>

enter image description here

Example

Live Example: http://www.rubular.com/r/bRmdy0YA0S

Sample Text

Note how the second image tag has some of the really difficult edge cases.

<img src="cid:img.jpg">
<img hrefsrc="NotMe.jpg" onmouseover=' src="NotTheMeEither.jpg" ; if ( 6 > x ) { funRotator(src) ; } ; ' src="cid:DifficultToFind.jpg">

Matches

[0][0] = <img src="cid:img.jpg">
[0][1] = cid:img.jpg

[1][0] = <img hrefsrc="NotMe.jpg" onmouseover=' src="NotTheMeEither.jpg" ; if ( 6 > x ) { funRotator(src) ; } ; ' src="cid:DifficultToFind.jpg">
[1][1] = cid:DifficultToFind.jpg

score 3 · Accepted Answer

The * quantifier is greedy: it matches as much as it can while allowing the rest of the pattern to match. In your case, \S* is likely consuming more text than you intended.

Consider using

my @accept = $message_body =~ /<img src="\S*?">/gi;

or

my @accept = $message_body =~ /<img src="[^"]+">/gi;

These patterns attempt to stop matching as soon as they detect a closing double-quote, but they are heuristics that could fail depending on how friendly your input is. To do the job properly, use an HTML parser.

score 0 · Accepted Answer

In case you missed n0rd's comment, here is the essential link again about the use of regular expressions with (X|HT)ML.

With that out of the way, here is one way to do it with a module (of course, just as TIMTOWTDI, there is also more than one module that would be suitable)

#!/usr/bin/perl

use strict;
use warnings;
use autodie qw(open);

use HTML::TreeBuilder::XPath;

my $file = shift or die "Missing argument! Usage: $0 FILENAME\n";

open( F, $file );

my $t=HTML::TreeBuilder::XPath->new();

$t->parse_file($file)
    or die "Could not parse $file\n";

foreach my $img ( $t->findnodes( '//img' ) ) {

    my $src    = $img->attr('src');
    my $width  = $img->attr('width');
    my $height = $img->attr('height');

    print $img->as_HTML, "\n";
    foreach my $attr ( qw(src width height alt title) ) {
        print "$attr = ", $img->attr($attr), "\n" if defined($img->attr($attr));
    }
    print "\n";
}

regex - Regex to capture tags fails when "src" value is different

3 回答 3

Description

Example

Related

Reference