请记住,使用正则表达式解析 html 并不是最好的方法,您可以使用这个更便携的解决方案:
$pattern = <<<'LOD'
~
(?: # open a non-capturing group
<a\s # begining of the a tag
(?: # open a non capturing group
[^h>]+ # all characters but "h" and "<" one or more times
| # OR
\Bh+ # one or more "h" not preceded by a word bundary
| # OR
h(?!ref\b) # "h" not followed by "ref"
)*+ # repeat the group zero or more times
href\s*=\s*"[^?]+\? # href with the begining of the link until the "?"
\K # reset all the match (this part is not needed)
| # OR
\G(?!\A) # a contiguous match
) # close the non-capturing group
(?: # open a non capturing group
(?<key>[^=&]++) # take the key
= # until the "="
(?<value>[^&"]++) # take the value
(?: & | (?=") ) # a "&" or followed by a double quote
| # OR
"[^>]*> # a double quote and the end of the opening tag
(?<content> # open the content named capturing group
(?: # open a non capturing group
[^<]+ # all characters but "<" one or more times
| # OR
<(?!/a\b) # a "<" not followed by "/a" (the closing a tag)
)*+ # repeat the group zero or more times
) # close the named capturing group
</a> # the closing tag (can be removed)
) # close the non-capturing group
~xi
LOD;
这种模式允许做几件事:
它不关心 a 标签中属性的顺序或数量
它不关心键/值对的数量(它需要全部)
它会忽略 url 内没有键/值的标签
它允许在此处使用空格 ( href = "
)
它支持内容部分内的 html 标签
但是提取结果有点困难:
preg_match_all($pattern, $subject, $matches);
foreach($matches['key'] as $k => $v) {
if (empty($v)) {
$result[] = array('values' => $keyval,
'content' => $matches['content'][$k]);
unset($keyval);
} else {
$keyval[] = array($v => $matches['value'][$k]);
}
}
print_r($result);
DOM 方式
这种方式的主要兴趣在于 DOM 解析器具有与浏览器(也是解析器)相似的行为,因为它不关心属性的数量或位置,简单,双引号或无引号,并且标签之间的内容类型。
$doc = new DOMDocument();
@$doc->loadHTML($yourhtml);
$linkNodeList = $doc->getElementsByTagName("a");
foreach($linkNodeList as $linkNode) {
if (preg_match('~var1=(?<var1>\d+)&var2=(?<var2>\d+)&var3=(?<var3>\d+)~i',
$linkNode->getAttribute('href'), $match)) {
foreach($match as $k => &$v) {
if (is_numeric($k)) unset($v);
}
// take the content between "a" tags
$content= '';
$children = $linkNode->childNodes;
foreach ($children as $child) {
$content .= $child->ownerDocument->saveXML( $child );
}
$result[] = array('values' => $match, 'content' => $content);
}
}
print_r($result);