我想将 HTML 页面中的所有 URL 选择到一个数组中,例如:
This is a webpage <a href="http://somesite.com/link1.php">with</a>
different kinds of <a href="http://somesite.com/link1.php"><img src="someimg.png"></a>
我想要的输出是:
with => http://somesite.se/link1.php
现在我得到:
<img src="someimg.png"> => http://somesite.com/link1.php
with => http://somesite.com/link1.php
我不希望在 start 和 end 之间包含图像的 urls/links 。只有那些有文字的。
我目前的代码是:
<?php
function innerHTML($node) {
$ret = '';
foreach ($node->childNodes as $node) {
$ret .= $node->ownerDocument->saveHTML($node);
}
return $ret;
}
$html = file_get_contents('http://somesite.com/'.$_GET['apt']);
$dom = new DOMDocument;
@$dom->loadHTML($html); // @ = Removes errors from the HTML...
$links = $dom->getElementsByTagName('a');
$result = array();
foreach ($links as $link) {
//$node = $link->nodeValue;
$node = innerHTML($link);
$href = $link->getAttribute('href');
if (preg_match('/\.pdf$/i', $href))
$result[$node] = $href;
}
print_r($result);
?>