php - 如何编写一个不消耗太多 RAM 的机器人？

Question

我有一个网络机器人，它消耗了我太多的内存，一段时间后，内存使用率达到 50%，进程被杀死；我不知道为什么内存使用量会这样增加，我没有包含“para.php”，它是一个用于并行 curl 请求的库。我想了解更多关于网络爬虫的知识，我搜索了很多，但找不到任何有用的文档或我可以使用的方法。

这是我从中获得 para.php 的库。

我的代码：

require_once "para.php";

class crawling{

public $montent;


public function crawl_page($url){

    $m = new Mongo();

    $muun = $m->howto->en->findOne(array("_id" => $url));

    if (isset($muun)) {
        return;
    }

    $m->howto->en->save(array("_id" => $url));

    echo $url;

    echo "\n";

    $para = new ParallelCurl(10);

    $para->startRequest($url, array($this,'on_request_done'));

    $para->finishAllRequests();

    preg_match_all("(<a href=\"(.*)\")siU", $this->montent, $matk);

    foreach($matk[1] as $longu){
        $href = $longu;
        if (0 !== strpos($href, 'http')) {
            $path = '/' . ltrim($href, '/');
            if (extension_loaded('http')) {
                $href = http_build_url($url, array('path' => $path));
            } else {
                $parts = parse_url($url);
                $href = $parts['scheme'] . '://';
                if (isset($parts['user']) && isset($parts['pass'])) {
                    $href .= $parts['user'] . ':' . $parts['pass'] . '@';
                }


                $href .= $parts['host'];
                if (isset($parts['port'])) {
                    $href .= ':' . $parts['port'];
                }
                $href .= $path;
            }
        }
        $this->crawl_page($longu);
    }
}

public function on_request_done($content) {
    $this->montent = $content;
}


$moj = new crawling;
$moj->crawl_page("http://www.example.com/");

score 0 · Accepted Answer

您在 1 个网址上调用此 crawl_page 函数。获取它的内容（$this->montent）并检查链接（$matk）。

虽然这些还没有被销毁，但您会递归，开始对 crawl_page 的新调用。$this->moment 将被新内容覆盖（没关系）。再往下一点，$matk（一个新变量）填充了新的 $this->montent 的链接。此时，内存中有 2 个 $matk：一个包含您首先开始处理的文档的所有链接，另一个包含原始文档中第一次链接到的文档的所有链接。

我建议找到所有链接并将它们保存到数据库中（而不是立即递归）。然后只需清除数据库中的链接队列，一个一个一个（每个新文档都向数据库添加一个新条目）

php - 如何编写一个不消耗太多 RAM 的机器人？

1 回答 1

Related

Reference