php - 正则表达式，如何找出里面所有不包含标签IMG的A标签？

Question

假设我们有这样的 HTML 代码。我们需要获取<a href=""></a>其中不包含img标签的所有标签。

<a href="http://domain1.com"><span>Here is link</span></a>
<a href="http://domain2.com" title="">Hello</a>
<a href="http://domain3.com" title=""><img src="" /></a>
<a href="http://domain4" title=""> I'm the image <img src="" /> yeah</a>

我正在使用这个正则表达式来查找所有 a 标签链接：

preg_match_all("!<a[^>]+href=\"?'?([^ \"'>]+)\"?'?[^>]*>(.*?)</a>!is", $content, $out);

我可以这样修改它：

preg_match_all("!<a[^>]+href=\"?'?([^ \"'>]+)\"?'?[^>]*>([^<>]+?)</a>!is", $content, $out);

但是我怎么能告诉它排除包含<img子字符串的结果<a href=""></a>呢？

score 3 · Accepted Answer

您需要使用像Simple DOM parser这样的 HTML 解析器。您不能使用正则表达式解析 HTML。

score 2 · Accepted Answer

Dom 是要走的路，但为了感兴趣，这里是解决方案：

在正则表达式中排除某些匹配项的最简单方法是使用“否定前瞻”或“否定后瞻”。如果在字符串中的任何位置找到否定表达式，则匹配失败。

例子：

^(?!.+<img.+)<a href=\"?\'?.+\"?\'?>.+</a>$

火柴：

<a href="http://domain1.com"><span>Here is link</span></a>
<a href="http://domain2.com" title="">Hello</a>

但不匹配：

<a href="http://domain3.com" title=""><img src="" /></a>
<a href="http://domain4" title=""> I'm the image <img src="" /> yeah</a>

负面期待是字符串的这一部分：

(?!.+<img.+)

这表示不匹配任何字符后跟 <img，后跟任何字符的字符串。

<a href=\"?\'?.+\"?\'?>.+</a>

其余的是我对 html 中锚标记的一般匹配，您可能想要使用替代匹配表达式。

根据您的使用情况，您可能需要省略开始和结束 ^ $ 字符。

有关向前/向后看的更多信息

http://www.codinghorror.com/blog/2005/10/ exclude-matches-with-regular-expressions.html

php - 正则表达式，如何找出里面所有不包含标签IMG的A标签？

2 回答 2

Related

Reference