0

I need a RegEx pattern for extracting all the properties of an image tag.

As we all know, there are lots of malformed HTML out there, so the pattern has to cover those possibilities.

I was looking at this solution https://stackoverflow.com/questions/138313/how-to-extract-img-src-title-and-alt-from-html-using-php but it didn't quite get it all:

I come up something like:

(alt|title|src|height|width)\s*=\s*["'][\W\w]+?["']

Is there any possibilities I'll be missing or a more efficient simple pattern?

EDIT:
Sorry, I will be more specific, I'm doing this using .NET so it's on the server side.
I've already a list of img tags, now I just need to parse the properties.

4

6 回答 6

5

As we all know, there are lots of malformed HTML out there, so the pattern has to cover those possibilities.

It won't. Use a HTML parser if you have to parse "evil" (from an unknown source) HTML.

于 2008-12-08T17:35:28.343 回答
1

If performance is not a big concern I'd go with an html parser (like BeautifulSoup in python) if you are doing this server-side or jquery or just plain javascript if you are doing it client-side. Granted it is overkill but it is a lot quicker, less likely to have bugs (since they've thought of the corner cases), and it will handle the potential malformedness.

于 2008-12-08T17:36:38.993 回答
1

您最好的选择是使用HTML Agility Pack之类的东西,而不是使用正则表达式。它旨在处理大量案例,并且由于敲定边缘案例,可以为您省去很多麻烦

于 2010-01-03T06:52:29.480 回答
0

Before comitting yourself to regex, see what it can do: RegEx match open tags except XHTML self-contained tags

于 2010-01-03T08:41:42.210 回答
0
/<img(\s+([a-z]{3,})=(["']([^"']*)["']|[\S]))+\s*/?>/i

一个 match_all 将返回(格式取决于您的库,但关键索引是):

0 -> image tag
1 -> attribute
2 -> attribute name
3 -> attribute value (with enclosing quotes if exists)
4 -> attribute value (without enclosing quotes if it has them, otherwise empty, use 3)
于 2010-01-03T08:57:47.647 回答
0

If you want all attribute values, might I suggest using the DOM? Something like element.attributes will work well.

If you insist on a regex //\b\w+="[^"]+"// should get everything.

于 2008-12-08T17:36:05.470 回答