php - 如何仅加载 html（并跳过媒体文件）

Question

我正在优化我的简单网络爬虫（目前使用 PHP/curl_multi）。

目标是在智能的同时抓取整个网站，并跳过非 html 内容。我尝试使用nobody，并且只发送HEAD请求，但这似乎不适用于每个网站（某些服务器不支持HEAD），导致exec暂停很长时间（有时比加载页面本身长得多）。

有没有其他方法可以在不下载整个内容的情况下获取页面类型，或者如果文件不是 html，则强制 CURL 放弃下载？

（编写我自己的 http 客户端不是一个选项，因为我打算稍后将 CURL 函数用作 cookie 和 ssl）。

score 1 · Accepted Answer

我没试过，但我明白了CURLOPT_PROGRESSFUNCTION。我敢打赌，如果您对正在下载的内容不感兴趣，您可以逐步阅读响应以查找content-type标头和可能的curl_close () 句柄。

CURLOPT_PROGRESSFUNCTION     The name of a callback function
where the callback function takes three parameters. The first is the
cURL resource, the second is a file-descriptor resource, and the 
third is length. Return the string containing the data.

http://www.php.net/manual/en/function.curl-setopt.php

score 1 · Accepted Answer

正确的方法是使用

curl_setopt($ch, CURLOPT_HEADERFUNCTION, 'curlHeaderCallback');

回调将接受 2 个参数 - 第一个 CURL 句柄，第二个 - 标头。每次新标头到达时都会调用它。

$acceptable=array('application/xhtml+xml',
'application/xml', 'text/plain',
'text/xml', 'text/html');

function curlHeaderCallback($resURL, $strHeader) { 
    global $acceptable;
    if (stripos($strHeader,'content-type')===0) {
        $type=strtolower(trim(array_shift(explode(';',array_pop(explode(':',$strHeader))))));
        if (!in_array($type,$acceptable))
            return 0;
    }
    return strlen($strHeader);

}

score 0 · Accepted Answer

这对我有用：

<?php
$handle = curl_init('http://www.google.com');
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
curl_setopt($handle, CURLOPT_HEADER, true);
$result = curl_exec($handle);
$type = curl_getinfo($handle, CURLINFO_CONTENT_TYPE);
if(strpos($type, 'text/html') !== false) {
    echo 'The URL is an HTML page.';
}
?>

score 0 · Accepted Answer

你看过fsockopen吗？

您可以打开一个到远程页面的套接字，然后只读取必要的内容。一旦Content-Type确定了标头，就可以关闭连接。

<?php
$type = 'Unknown';
$fp = fsockopen("www.example.com", 80, $errno, $errstr, 30);
if (!$fp) {
    echo "$errstr ($errno)<br />\n";
} else {
    $out = "GET / HTTP/1.1\r\n";
    $out .= "Host: www.example.com\r\n";
    $out .= "Connection: Close\r\n\r\n";
    fwrite($fp, $out);

    $in = '';
    while (!feof($fp)) {
        $in .= fgets($fp, 128);
        if ( preg_match( '/Content-Type: (.+)\n/i', $in, &$matches ) ) {
            $type = $matches[1];
            break;
        }
    }
    fclose($fp);
}
echo $type;
?>

php - 如何仅加载 html（并跳过媒体文件）

4 回答 4

Related

Reference