0

我正在解析 html 代码并卡住了。我希望有人帮助我。有关详细代码,请点击此链接:http ://regexr.com?369sg

我想得到任何匹配:

<td class="weekday reservation alternate fixwidth calday fixwidth " > ? </td><!--1-->
<td class="weekend reservation alternate fixwidth calday fixwidth " > ? </td><!--2-->
<td class="weekday calday fixwidth">&nbsp;</td><!--3-->
<td class="weekend calday fixwidth">&nbsp;</td><!--4-->

如果我使用这种模式:

/<td class="(weekday|weekend) reservation (primary|alternate) fixwidth calday fixwidth " >(.*?)<\/td>/

如果我使用这种模式,我只有数字 1 和 2:

/<td class="(weekday|weekend) calday fixwidth">(.*?)<\/td>/

我只有3,4号。

如何使用一种模式匹配以上所有数字(1,2,3,4)?无论如何,我正在使用 preg_match_all 函数。

请帮助我,谢谢。

4

2 回答 2

0

Apart that you'd better to use an html parser, here is a regex that do the job:

/<td class="(weekday|weekend) (?:reservation (primary|alternate) fixwidth )?calday fixwidth " >(.*?)<\/td>/
于 2013-09-11T11:58:36.627 回答
0

I believe I'm required by Stack Overflow to say something bad about using regular expressions to scrape HTML:

  • DON'T use regular expressions as a stand alone parser.
  • DO use regular expressions if you're just trying to find some strings wtihin some text and features of the language do not matter.

So here's a new regular expression:

<td.+?class="(?:weekday|weekend)(?:\s+reservation\s+(?:primary|alternate)\s+fixwidth\s+|\s+)calday fixwidth.*?"[^>]*>(.+?)</td>

REY

  • <td.+?class=: This will allow you to have anything in between <td and class. So if you have other attributes you'll be cool. Do note lazy quantifiers like +? have a performance penalty. So don't do this a million times.
  • (?:weekday|weekend): Pretty much the same that you had before, except it is a non capturing group. I use the non capturing groups so that $matches[1] will have the code you're looking for.
  • (?:\s+reservation\s+(?:primary|alternate)\s+fixwidth\s+|\s+): This will match either the string in the first two examples, or just a space for last two. I considered just doing .+?, if those classes aren't important do that instead.
  • calday fixwidth.*?": This allows for any additional classes.
  • "[^>]*>: This allows for more attributes, but it is better performing than .*?.
  • (.+?)</td>: End of pattern.

One note, this will fail if you have nested matches and you will need to use a parser instead:

<td class="weekday calday fixwidth">
   <table><tr>
      <td class="weekday calday fixwidth">&nbsp;</td>
   </tr></table>
</td>

The result would have one match:

    <td class="weekday calday fixwidth">
   <table><tr>
      <td class="weekday calday fixwidth">&nbsp;</td>

And group 1 would be:

  <table><tr>
      <td class="weekday calday fixwidth">&nbsp;

Alternative

Instead of such a specific pattern, I would try the more flexible alternative instead:

<td.+?class="(?:[^"]*(?:weekday|weekend|primary|alternate|calday|fixwidth)){3,}[^"]*"[^>]*>(.+?)</td>

REY

This uses a look ahead to try to match a td that has a class attribute with at least three instances within the alternation.

于 2013-09-11T11:58:44.700 回答