php - 需要帮助来解析 html 代码

Question

我正在解析 html 代码并卡住了。我希望有人帮助我。有关详细代码，请点击此链接：http ://regexr.com?369sg

我想得到任何匹配：

<td class="weekday reservation alternate fixwidth calday fixwidth " > ? </td><!--1-->
<td class="weekend reservation alternate fixwidth calday fixwidth " > ? </td><!--2-->
<td class="weekday calday fixwidth">&nbsp;</td><!--3-->
<td class="weekend calday fixwidth">&nbsp;</td><!--4-->

如果我使用这种模式：

/<td class="(weekday|weekend) reservation (primary|alternate) fixwidth calday fixwidth " >(.*?)<\/td>/

如果我使用这种模式，我只有数字 1 和 2：

/<td class="(weekday|weekend) calday fixwidth">(.*?)<\/td>/

我只有3,4号。

如何使用一种模式匹配以上所有数字（1,2,3,4）？无论如何，我正在使用 preg_match_all 函数。

请帮助我，谢谢。

score 0 · Accepted Answer

Apart that you'd better to use an html parser, here is a regex that do the job:

/<td class="(weekday|weekend) (?:reservation (primary|alternate) fixwidth )?calday fixwidth " >(.*?)<\/td>/

score 0 · Accepted Answer

I believe I'm required by Stack Overflow to say something bad about using regular expressions to scrape HTML:

DON'T use regular expressions as a stand alone parser.
DO use regular expressions if you're just trying to find some strings wtihin some text and features of the language do not matter.

So here's a new regular expression:

<td.+?class="(?:weekday|weekend)(?:\s+reservation\s+(?:primary|alternate)\s+fixwidth\s+|\s+)calday fixwidth.*?"[^>]*>(.+?)</td>

REY

<td.+?class=: This will allow you to have anything in between <td and class. So if you have other attributes you'll be cool. Do note lazy quantifiers like +? have a performance penalty. So don't do this a million times.
(?:weekday|weekend): Pretty much the same that you had before, except it is a non capturing group. I use the non capturing groups so that $matches[1] will have the code you're looking for.
(?:\s+reservation\s+(?:primary|alternate)\s+fixwidth\s+|\s+): This will match either the string in the first two examples, or just a space for last two. I considered just doing .+?, if those classes aren't important do that instead.
calday fixwidth.*?": This allows for any additional classes.
"[^>]*>: This allows for more attributes, but it is better performing than .*?.
(.+?)</td>: End of pattern.

One note, this will fail if you have nested matches and you will need to use a parser instead:

<td class="weekday calday fixwidth">
   <table><tr>
      <td class="weekday calday fixwidth">&nbsp;</td>
   </tr></table>
</td>

The result would have one match:

    <td class="weekday calday fixwidth">
   <table><tr>
      <td class="weekday calday fixwidth">&nbsp;</td>

And group 1 would be:

  <table><tr>
      <td class="weekday calday fixwidth">&nbsp;

Alternative

Instead of such a specific pattern, I would try the more flexible alternative instead:

<td.+?class="(?:[^"]*(?:weekday|weekend|primary|alternate|calday|fixwidth)){3,}[^"]*"[^>]*>(.+?)</td>

REY

This uses a look ahead to try to match a td that has a class attribute with at least three instances within the alternation.

php - 需要帮助来解析 html 代码

2 回答 2

REY

REY

Related

Reference