regex - 使用正则表达式删除重复链接

Question

我正在尝试解析一些 html 并删除不必要的重复链接。例如，我想要以下代码：

<p>
  Lorem ipsum amet 
  <a href="http://edition.cnn.com/">
    Proin lacinia posuere
  </a>
   sit ipsum.
</p>
<p>
  <a href="http://www.google.com/articles/blah">
    [caption align="alignright"]
    <a href="http://www.google.com/articles/blah">
      <img src="http://hoohlr.dev/Picture-142-300x222.png" alt="Blah blah/Flickr " height="222" class="size-medium wp-image-4351" />
    </a>
     sociis magnis [/caption]
  </a>
</p>

要转换成这个（删除 [caption] 之前的链接以及结束标记：

<p>
  Lorem ipsum amet 
  <a href="http://edition.cnn.com/">
    Proin lacinia posuere
  </a>
   sit ipsum.
</p>
<p>
  [caption align="alignright"]
  <a href="http://www.google.com/articles/blah">
    <img src="http://hoohlr.dev/Picture-142-300x222.png" alt="Blah blah/Flickr " height="222" class="size-medium wp-image-4351" />
  </a>
   sociis magnis [/caption]
</p>

删除的链接应始终位于 [标题] 之前。任何擅长正则表达式的人都可以使用 php preg_replace （或更简单的方法）帮助我做到这一点吗？

我将不胜感激。谢谢！

编辑：好的，我对我正在寻找的东西做了很好的尝试。http://regexr.com?31t05和http://regexr.com?31svv 试图发布它作为该网站的答案不会让我...任何人都可以改进它吗？

score 0 · Accepted Answer

这个经过测试的脚本适用于您的测试数据：

<?php // test.php Rev:20120820_2200
function stripNestedAnchorTags($text) {
    $re = '%
        # Match (invalid) outer A element containing inner A element.
        <a\b[^<>]+>\s*               # Outer A element start tag (and ws).
        (                            # $1: contents of outer A element.
          [^<]*(?:<(?!/?a\b)[^<]*)*  # Everything up to inner <a>
          <a\b[^<>]+>                # Inner A element start tag.
          [^<]*(?:<(?!/?a\b)[^<]*)*  # Everything up to inner </a>
          </a>                       # Inner A element end tag.
          [^<]*(?:<(?!/?a\b)[^<]*)*  # Everything up to outer </a>
        )                            # End $1: contents of outer A.
        </a>\s*                      # Outer A element end tag (and ws).
        %ix';
        while(preg_match($re, $text))
            $text = preg_replace($re, '$1', $text);
        return $text;
}
$idata = file_get_contents('testdata.html');
$odata = stripNestedAnchorTags($idata);
file_put_contents('testdata_out.html', $odata);
?>

regex - 使用正则表达式删除重复链接

1 回答 1

Related

Reference