对于它的价值,这是您正在寻找的正则表达式:
原始匹配模式:
<a ((?:(?!href).)*?)href=[\"\']https:\/\/((?:(?!my-domain.de).)*?)[\"\'](.*?)>(.*?)<\/a>
原始替换模式:
<a $1href="http://$2"$3>$4</a>
PHP代码是:
$content = preg_replace('/<a ((?:(?!href).)*?)href=[\"\']https:\/\/((?:(?!my-domain.de).)*?)[\"\'](.*?)>(.*?)<\/a>/i','<a $1href="http://$2"$3>$4</a>',$content);
话虽如此,请预先警告——就Andy Lester而言,这个正则表达式并不可靠。尽管在我看来,这个问题并不完全是“HTML 的本质”,或者至少不是那么简单。这个公认的伟大资源——http: //htmlparsing.com/regexes——中提出的观点是,你正试图在一条非常崎岖的道路上重新发明轮子。更广泛的关注是“不是正则表达式本身是邪恶的,而是正则表达式的过度使用是邪恶的。” 这句话出自 Jeff Atwood,来自对正则表达式的快乐和恐惧的特殊阐述:正则表达式:现在你有两个问题(他还有一篇文章专门警告不要使用正则表达式解析 HTML—— Parsing Html The Cthulhu Way。)
例如,特别是在我上面的“解决方案”的情况下——以下输入(带有行返回)将不匹配,尽管是有效的 HTML:
<a title="mytitle"
href="https://www.other-domain.de/path/index.html" 
target="_blank">other domain</a>
但是,可以根据需要处理以下输入:
<a href="https://my-domain.de">my domain</a>
<a href="https://other-domain.de">other domain</a>
<a href="https://www.my-domain.de/path/index.html">my domain</a>
<a href="https://www.other-domain.de/path/index.html">other domain</a>
<a title="other title" href="https://www.my-domain.de/path/index.html" target="_blank">other domain</a>
<a title="my title" href="https://www.other-domain.de/path/index.html" target="_blank">my domain</a>
变成:
<a href="https://my-domain.de">my domain</a>
<a href="http://other-domain.de">other domain</a>
<a href="https://www.my-domain.de/path/index.html">my domain</a>
<a href="http://www.other-domain.de/path/index.html">other domain</a>
<a title="other title" href="https://www.my-domain.de/path/index.html" target="_blank">other domain</a>
<a title="my title" href="http://www.other-domain.de/path/index.html" target="_blank">my domain</a>
解释正则表达式完整细分的一个很好的资源在这里:http ://www.myregextester.com/index.php
要在该工具上复制测试:
- 选择“替换”操作
- 将您的正则表达式放入“匹配模式”
- 将替换放入“替换模式”
- 选择“i”标志复选框
- 选择“解释”复选框
- 选择“PHP”复选框
- 将您的目标内容放入“源文本”
- 点击“提交”
为了方便和后代,我在下面包含了该工具提供的完整解释,但其中两个概念亮点是:
匹配模式说明:
The regular expression:
`(?i-msx:<a ((?:(?!href).)*?)href=[\"\']https:\/\/((?:(?!my-domain.de).)*?)[\"\'](.*?)>(.*?)<\/a>)`
matches as follows:
NODE                     EXPLANATION
----------------------------------------------------------------------
(?i-msx:                 group, but do not capture (case-insensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  <a                       '<a '
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
----------------------------------------------------------------------
      (?!                      look ahead to see if there is not:
----------------------------------------------------------------------
        href                     'href'
----------------------------------------------------------------------
      )                        end of look-ahead
----------------------------------------------------------------------
      .                        any character except \n
----------------------------------------------------------------------
    )*?                      end of grouping
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  href=                    'href='
----------------------------------------------------------------------
  [\"\']                   any character of: '\"', '\''
----------------------------------------------------------------------
  https:                   'https:'
----------------------------------------------------------------------
  \/                       '/'
----------------------------------------------------------------------
  \/                       '/'
----------------------------------------------------------------------
  (                        group and capture to \2:
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
----------------------------------------------------------------------
      (?!                      look ahead to see if there is not:
----------------------------------------------------------------------
        my-domain                'my-domain'
----------------------------------------------------------------------
        .                        any character except \n
----------------------------------------------------------------------
        de                       'de'
----------------------------------------------------------------------
      )                        end of look-ahead
----------------------------------------------------------------------
      .                        any character except \n
----------------------------------------------------------------------
    )*?                      end of grouping
----------------------------------------------------------------------
  )                        end of \2
----------------------------------------------------------------------
  [\"\']                   any character of: '\"', '\''
----------------------------------------------------------------------
  (                        group and capture to \3:
----------------------------------------------------------------------
    .*?                      any character except \n (0 or more times
                             (matching the least amount possible))
----------------------------------------------------------------------
  )                        end of \3
----------------------------------------------------------------------
  >                        '>'
----------------------------------------------------------------------
  (                        group and capture to \4:
----------------------------------------------------------------------
    .*?                      any character except \n (0 or more times
                             (matching the least amount possible))
----------------------------------------------------------------------
  )                        end of \4
----------------------------------------------------------------------
  <                        '<'
----------------------------------------------------------------------
  \/                       '/'
----------------------------------------------------------------------
  a>                       'a>'
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------