php - 对不同的链接和 href 分隔符（“和”）有一些正则表达式的头痛

Question

因此，我想将以下链接结构与 php 中的 preg_match_all 进行匹配。

<a garbage href="http://this.is.a.link.com/?query=this has invalid spaces" possible garbage>
<a garbage href='http://this.is.a.link.com/?query=this also has has invalid spaces' possible garbage>
<a garbage href=http://this.is.a.link.com/?query=no_spaces_but_no_delimiters possible garbage>
<a garbage href=http://this.is.a.link.com/?query=no_spaces_but_no_delimiters>

我可以这样做

'#<a[^>]*?href=("|\')(.*?)("|\')#is'

或者我可以得到所有 3 个，但如果前两个中有空格，则不能：

'#<a[^>]*?href=("|\')?(.*?)[\s\"\'>]#is'

我该如何制定它，以便它可以拾取 " 和 ' 用潜在的空格分隔，但也可以正确编码没有分隔符的 URL。

score 1 · Accepted Answer

好的，这似乎有效：

'#<a[^>]*?href=((["\'][^\'"]+["\'])|([^"\'\s>]+))#is'

（$matches[1] 包含网址）

唯一令人烦恼的是，引用的 url 仍然有引号，所以你必须把它们去掉：

$first = substr($match, 0, 1);
if($first == '"' || $first == "'")
    $match = substr($match, 1, -1);

score 1 · Accepted Answer

编辑：我对它进行了编辑，使其比我最初发布的效果好一点。

你几乎在第二个正则表达式中有它：

'#<a[^>]*?href=("|\')?(.*?)[\\1|>]#is'

返回以下数组：

array(3) {
  [0]=>
  array(4) {
    [0]=>
    string(92) "<a garbage href="http://this.is.a.link.com/?query=this has invalid spaces" possible garbage>"
    [1]=>
    string(101) "<a garbage href='http://this.is.a.link.com/?query=this also has has invalid spaces' possible garbage>"
    [2]=>
    string(94) "<a garbage href=http://this.is.a.link.com/?query=no_spaces_but_no_delimiters possible garbage>"
    [3]=>
    string(77) "<a garbage href=http://this.is.a.link.com/?query=no_spaces_but_no_delimiters>"
  }
  [1]=>
  array(4) {
    [0]=>
    string(1) """
    [1]=>
    string(1) "'"
    [2]=>
    string(0) ""
    [3]=>
    string(0) ""
  }
  [2]=>
  array(4) {
    [0]=>
    string(74) "http://this.is.a.link.com/?query=this has invalid spaces" possible garbage"
    [1]=>
    string(83) "http://this.is.a.link.com/?query=this also has has invalid spaces' possible garbage"
    [2]=>
    string(77) "http://this.is.a.link.com/?query=no_spaces_but_no_delimiters possible garbage"
    [3]=>
    string(60) "http://this.is.a.link.com/?query=no_spaces_but_no_delimiters"
  }
}

使用或不使用分隔符。

score 1 · Accepted Answer

使用 DOM 解析器。您不能使用正则表达式解析 (x)HTML。

$html = <<<END
<a garbage href="http://this.is.a.link.com/?query=this has invalid spaces" possible garbage>
<a garbage href='http://this.is.a.link.com/?query=this also has has invalid spaces' possible garbage>
<a garbage href=http://this.is.a.link.com/?query=no_spaces_but_no_delimiters possible garbage>
<a garbage href=http://this.is.a.link.com/?query=no_spaces_but_no_delimiters>
END;

$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML($html);
libxml_use_internal_errors(false);

$items = $domd->getElementsByTagName("a");
foreach ($items as $item) {
  var_dump($item->getAttribute("href"));
}

score 0 · Accepted Answer

当您说要匹配它们时，您是在尝试从链接中提取信息，还是只是查找带有 href 的超链接？如果您只追求后者，这应该可以正常工作：

/<a[^>]*href=[^\s].*?>/

score 0 · Accepted Answer

正如@JasonWoof 所指出的，您需要使用嵌入式替代：一种用于引用的 URL，另一种用于未引用。我还建议使用捕获组来确定正在使用哪种引用，就像@DanHorrigan 所做的那样。通过添加负前瞻 ( (?!\\2)) 和所有格量词 ( *+)，您可以创建一个非常健壮的正则表达式，而且速度也很快：

~
<a\\s+[^>]*?\\bhref=
(
  (["'])          # capture the opening quote
  (?:(?!\\2).)*+  # anything else, zero or more times
  \\2             # match the closing quote
|
  [^\\s>]*+   # anything but whitespace or closing brackets
)
~ix

在 ideone 上查看它的实际效果。 （双反斜杠是因为正则表达式是以 PHP heredoc 的形式编写的。我更喜欢使用 nowdoc，但 ideone 显然仍在运行 PHP 5.2。）

php - 对不同的链接和 href 分隔符（“和”）有一些正则表达式的头痛

5 回答 5

Related

Reference