I believe I'm required by Stack Overflow to say something bad about using regular expressions to scrape HTML:
- DON'T use regular expressions as a stand alone parser.
- DO use regular expressions if you're just trying to find some strings wtihin some text and features of the language do not matter.
So here's a new regular expression:
<td.+?class="(?:weekday|weekend)(?:\s+reservation\s+(?:primary|alternate)\s+fixwidth\s+|\s+)calday fixwidth.*?"[^>]*>(.+?)</td>
<td.+?class=
: This will allow you to have anything in between <td
and class
. So if you have other attributes you'll be cool. Do note lazy quantifiers like +?
have a performance penalty. So don't do this a million times.
(?:weekday|weekend)
: Pretty much the same that you had before, except it is a non capturing group. I use the non capturing groups so that $matches[1]
will have the code you're looking for.
(?:\s+reservation\s+(?:primary|alternate)\s+fixwidth\s+|\s+)
: This will match either the string in the first two examples, or just a space for last two. I considered just doing .+?
, if those classes aren't important do that instead.
calday fixwidth.*?"
: This allows for any additional classes.
"[^>]*>
: This allows for more attributes, but it is better performing than .*?
.
(.+?)</td>
: End of pattern.
One note, this will fail if you have nested matches and you will need to use a parser instead:
<td class="weekday calday fixwidth">
<table><tr>
<td class="weekday calday fixwidth"> </td>
</tr></table>
</td>
The result would have one match:
<td class="weekday calday fixwidth">
<table><tr>
<td class="weekday calday fixwidth"> </td>
And group 1 would be:
<table><tr>
<td class="weekday calday fixwidth">
Alternative
Instead of such a specific pattern, I would try the more flexible alternative instead:
<td.+?class="(?:[^"]*(?:weekday|weekend|primary|alternate|calday|fixwidth)){3,}[^"]*"[^>]*>(.+?)</td>
This uses a look ahead to try to match a td
that has a class
attribute with at least three instances within the alternation.