php - PHP - Parse_url 只获取页面

Question

我目前正在开发一个小型网络爬虫作为一个副项目，基本上让它收集页面上的所有href，然后解析这些，我的问题是。

我怎样才能得到实际的页面结果？目前我正在使用以下

foreach($page->getElementsByTagName('a') as $link) 
{
    $compare_url = parse_url($link->getAttribute('href'));
    if (@$compare_url['host'] == "") 
    { 
        $links[] = 'http://'.@$base_url['host'].'/'.$link->getAttribute('href');
    }
    elseif ( @$base_url['host'] == @$compare_url['host'] ) 
    {
            $links[] = $link->getAttribute('href');
    }   

 }

如您所见，这将引入 jpeg、exe 文件等。我只需要提取 .php、.html、.asp 等网页。

我不确定是否有一些功能可以解决这个问题，或者它是否需要来自某种主列表的正则表达式？

谢谢

score 1 · Accepted Answer

由于单独的 URL 字符串不会以任何方式与其背后的资源相关联，因此您必须走出去向网络服务器询问它们。为此，有一个名为HEAD的 HTTP 方法，因此您不必下载所有内容。

您可以在 php 中使用 curl 来实现这一点，如下所示：

function is_html($url) {
    function curl_head($url) {
        $curl = curl_init($url);
        curl_setopt($curl, CURLOPT_NOBODY, true);
        curl_setopt($curl, CURLOPT_HEADER, true);
        curl_setopt($curl, CURLOPT_MAXREDIRS, 5);
        curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true );
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($curl, CURLOPT_HTTP_VERSION , CURL_HTTP_VERSION_1_1);
        $content = curl_exec($curl);
        curl_close($curl);

        // redirected heads just pile up one after another
        $parts = explode("\r\n\r\n", trim($content));

        // return only the last one
        return end($parts);
    }
    $header = curl_head('http://github.com');
    // look for the content-type part of the header response
    return preg_match('/content-type\s*:\s*text\/html/i', $header);
}

var_dump(is_html('http://github.com'));

此版本只接受text/html响应，不检查响应是 404 还是其他错误（但随后重定向最多 5 次跳转）。您可以调整正则表达式或从 curl 响应中添加一些错误处理，或者通过匹配标题字符串的第一行。

注意：网络服务器将在这些 URL 后面运行脚本以给您响应。小心不要通过探测或抓取“删除”或“取消订阅”类型的链接使主机过载。

score 0 · Accepted Answer

要检查页面是否有效（html，php ...扩展名，请使用此功能：

function check($url){
$extensions=array("php","html"); //Add extensions here
foreach($extensions as $ext){
if(substr($url,-(strlen($ext)+1))==".".$ext){
return 1;
}
}
return 0;
}
foreach($page->getElementsByTagName('a') as $link) {
    $compare_url = parse_url($link->getAttribute('href'));
    if (@$compare_url['host'] == "") { if(check($link->getAttribute('href'))){ $links[] = 'http://'.@$base_url['host'].'/'.$link->getAttribute('href');} }
    elseif ( @$base_url['host'] == @$compare_url['host'] ) {
            if(check($link->getAttribute('href'))){ $links[] = $link->getAttribute('href'); }
}

score 0 · Accepted Answer

考虑使用preg_match来检查链接的类型（应用程序、图片、html 文件）并考虑结果来决定做什么。

另一种选择（也很简单）是使用explode并查找位于（扩展名）之后的 url 的最后一个字符串，.例如：

//If the URL will has any one of the following extensions , ignore them.
$forbid_ext = array('jpg','gif','exe');

foreach($page->getElementsByTagName('a') as $link) {
    $compare_url = parse_url($link->getAttribute('href'));
    if (@$compare_url['host'] == "")
    { 
           if(check_link_type($link->getAttribute('href')))
           $links[] = 'http://'.@$base_url['host'].'/'.$link->getAttribute('href');
    }
    elseif ( @$base_url['host'] == @$compare_url['host'] )
    {
           if(check_link_type($link->getAttribute('href')))
            $links[] = $link->getAttribute('href');
    }   

    }

function check_link_type($url)
{
   global $forbid_ext;

   $ext = end(explode("." , $url));
   if(in_array($ext , $forbid_ext))
     return false;
   return true;
}

更新（而不是检查“禁止”扩展，让我们寻找好的扩展）

$good_ext = array('html','php','asp');
function check_link_type($url)
{
   global $good_ext;

   $ext = end(explode("." , $url));
   if($ext == "" || !in_array($ext , $good_ext))
     return true;
   return false;
}

php - PHP - Parse_url 只获取页面

3 回答 3

Related

Reference