描述
这个正则表达式将:
- 匹配所有具有
class
属性的锚标记vip
- 捕获
href
这些锚标记的属性值
- 将避免有问题的边缘情况
- 允许
class
并href
以任意顺序出现在锚标签中
more to explore
部分后不捕获
<a\b(?=\s)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\shref=('[^']*'|"[^"]*"|[^'"][^\s>]*))(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sclass=['"]?vip['"]?)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"\s]*)*"\s?>.*?</a>(?=.*?More\sto\sexplore)
PHP 代码示例:
示例文本
注意第二行有一些可能有问题的文本
<a href="http://www.ebay.co.uk/blahblah-11" class="vip" title="x" itemprop="name">text here</a>
<a onmouseover=' var class="vip" ; funClassSwap(class); ' href="http://www.ebay.co.uk/blahblah-22"><form><input type="image" src="submit.gif"></form></a>
<a class="vip" href="http://www.ebay.co.uk/blahblah-33" title="x" itemprop="name">more text</a>
<div class="seoi-c">
<h2 class="seoi-h">More to explore</h2>
<div class="fl">
<ul class="tso-u">
<li><a href="http://www.ebay.com/sch/Lathes-/97230/i.html?_dcat=97230&Type=CNC&_trksid=p2045573.m2389" title="Lathes in Metalworking Equipment CNC">Lathes in Metalworking Equipment CNC</a></li>
</ul>
</div>
<div class="fl">
<ul class="tso-u">
</ul>
</div>
</div>
<a class="vip" href="http://www.ebay.co.uk/blahblah-44" title="x" itemprop="name">more text</a>
代码
<?php
$sourcestring="your source string";
preg_match_all('/<a\b(?=\s) # capture the open tag
(?=(?:[^>=]|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*?\shref=(\'[^\']*\'|"[^"]*"|[^\'"][^\s>]*)) # get the href attribute
(?=(?:[^>=]|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*?\sclass=[\'"]?vip[\'"]?) # validate the class attribute
(?:[^>=]|=\'[^\']*\'|="[^"]*"|=[^\'"\s]*)*"\s?> # get the entire tag
.*?<\/a> # capture the entire anchor tag
(?=.*?More\sto\sexplore) # validate this match is before the 'more to explore' section
/imsx',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
火柴
[0][0] = <a href="http://www.ebay.co.uk/blahblah-11" class="vip" title="x" itemprop="name">text here</a>
[0][2] = "http://www.ebay.co.uk/blahblah-11"
[1][0] = <a class="vip" href="http://www.ebay.co.uk/blahblah-33" title="x" itemprop="name">more text</a>
[1][3] = "http://www.ebay.co.uk/blahblah-33"