php - 是否有任何 apache 配置来增强网络爬取性能？

Question

我有一个 php 网络爬虫，当在 localhost 上运行时，它经常在几页后冻结，让我的网络浏览器显示加载标志，仅此而已。

我已经检查了我的代码，它可能有一个错误.. 虽然在过去几个小时查看它之后，我已经准备好探索其他可能性。

当我的爬虫运行时，它会随着不同进程的开始和结束而转储信息。我也经常flush(); 以确保浏览器向他们显示 ost 最新消息。（这为浏览器提供了控制台类型的外观）。

我现在研究 Apache 配置的原因是因为我的程序并不总是冻结在同一个地方。有时它在搜索要添加到队列的新 URL 的“a”标签时冻结，有时它在下载 xhtml 数据本身时冻结，此时：

 private function _getXhtml() {
        $curl = curl_init();
        if (!$curl) {
            throw new Exception('Unable to init curl. ' . curl_error($curl));
        }
        curl_setopt($curl, CURLOPT_URL, $this->_urlCurrent);
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
        // Faking user agent
        curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)');
        $xhtml = curl_exec($curl);
        if (!$xhtml) {
            $xhtml = null;
            echo 'PROBLEM' . $this->_urlCurrent . '<br />';
            //throw new Exception('Unable to read XHTML. ' . curl_error($curl));
        }
        curl_close($curl);
        return $xhtml;
    }

除此之外，我想知道是否可以对 apache 的配置文件或 PHP.ini 进行任何修改以增强 localhost 环境以进行 Web 抓取？

任何帮助，将不胜感激。

更新：

我相信我已将其范围缩小到 Zend_Dom_Query。这就是为什么我的应用程序在不同阶段崩溃的原因（有时当它为爬行列表抓取一个 href 时，有时当它在页面中寻找某些东西以“收获”时）

这是我的输出示例。

在这里，应用程序在第一页崩溃..同时获取一个 url。

    string(25) "Scraping page number 0..." 
string(9) "Mining..." 
string(15) "Getting <a>...." 
string(24) "Finished getting <a>...." 
string(20) "Getting <a href>...." 
string(43) "Mining page http://www.a-site.co.uk/ ..." 
string(17) "New page found..." 
string(18) "Page confirmed...." 
string(29) "Finished Getting <a href>...." 
string(20) "Getting <a href>...." 
string(43) "Mining page http://www.a-site.co.uk/ ..." 
string(29) "Finished Getting <a href>...." 
string(20) "Getting <a href>...."

在这里，应用程序在提取元素时失败

string(25) "Scraping page number 5..."
string(9) "Mining..."

//This bit loops for around 70 URLS
string(15) "Getting <a>...."
string(24) "Finished getting <a>...."
string(20) "Getting <a href>...."
string(48) "Mining page http://www.a-site.org ..."
string(29) "Finished Getting <a href>...."
//end loop

string(70) "Harvesting http://www.a.site.org/a-url-path/..."
string(19) "Harvesting html element..."

score 0 · Accepted Answer

该脚本看起来不错，因此可能是您正在抓取的网站尝试添加

curl_setopt($curl, CURLOPT_CONNECTTIMEOUT ,5); //timeout in seconds.
curl_setopt($curl, CURLOPT_TIMEOUT, 30); //timeout in seconds.

你也可以尝试让你的爬虫看起来更像一个真正的浏览器。我个人使用我自己的头文件，我可以使用 fiddler2 找到，然后在 curl 中重新创建它们。

php - 是否有任何 apache 配置来增强网络爬取性能？

1 回答 1

Related

Reference