php - 正则表达式从 html tsring 获取部分 url

Question

我正在处理一个完整的 html 文档，我需要提取 url，但前提是匹配所需的域

<html>
<div id="" class="">junk
<a href="http://example.com/foo/bar">example.com</a>
morejunk
<a href="http://notexample.com/foo/bar">notexample.com</a>
</div>
</html>

从那个垃圾部分，我需要获取 example.com 的完整 url，而不是其余部分 (notexample.com)。那将是“http://example.com/foo/bar”甚至更好，只有那个 url (bar) 女巫的最后一部分当然每次都会不同。

希望我已经足够清楚了，非常感谢！

编辑：使用 php

score 1 · Accepted Answer

正则表达式是您在解析 HTML 时必须避免的事情。这是一个基于 DOM 解析器的代码，可以满足您的需要：

$html = <<< EOF
<html>
<div id="" class="">junk
<a href="http://example.com/foo/bar">example.com</a>
morejunk
<a href="http://notexample.com/foo/bar">notexample.com</a>
</div>
</html>
EOF;
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query("//a"); // gets all the links
for($i=0; $i < $nodelist->length; $i++) {
    $node = $nodelist->item($i);
    $val = $node->attributes->getNamedItem('href')->nodeValue;
    if (preg_match('#^https?://example\.com/foo/(.*)$#', $val, $m)) 
       echo "$m[1]\n"; // prints bar
}

php - 正则表达式从 html tsring 获取部分 url

1 回答 1

Related

Reference