php - 使用 curl_multi_getcontent() 时如何最小化 PHP Spider 的内存消耗？

Question

我希望有人可以帮助我解决这个问题。我目前正在用 PHP 编写一个蜘蛛函数，它递归地爬过一个网站（通过它在网站页面上找到的链接）直到预先指定的深度。

到目前为止，我的蜘蛛最多可以在 2 个深度级别上工作。我的问题是当深度下降 3 级或更多级别时，尤其是在较大的网站上。我遇到了一个致命的内存错误，我认为这与使用 cURL 的所有递归多处理有关（而且因为某些站点上的 3 级下降可能意味着要处理数千个 URL）。

致命错误：第 105 行 C:\xampp\htdocs\crawler.php 中允许的内存大小为 134217728 字节已用尽（试图分配 366030 字节）

我的问题是关于我可能做错了什么（或者我应该做什么）以尽量减少内存消耗。

这是代码当前的样子，与内存使用相关的重要区域保持不变，更复杂的处理部分替换为伪代码/注释（以使其更易于阅读）。谢谢！

<?php

function crawler( $urlArray, $visitedUrlArray, $depth ){

    /* Recursion check 
       --------------- */
    if( empty( $urlArray) || ( $depth < 1 ) ){
        return;
    }

    /* Set up Multi-Handler 
       -------------------- */
    $multiCURLHandler = curl_multi_init();      
    $curlHandleArray= array();

    foreach( $urlArray as $url ){
        $curlHandleArray[$url] = curl_init();
        curl_setopt( $curlHandleArray[$url], CURLOPT_URL, $url );
        curl_setopt( $curlHandleArray[$url], CURLOPT_HEADER, 0 );
        curl_setopt( $curlHandleArray[$url], CURLOPT_TIMEOUT, 1000 );
        curl_setopt( $curlHandleArray[$url], CURLOPT_RETURNTRANSFER , 1 );  
        curl_multi_add_handle( $multiCURLHandler, $curlHandleArray[$url] );
    }

    /* Run Multi-Exec 
       -------------- */
    $running = null;
    do {
        curl_multi_exec( $multiCURLHandler, $running );
    }
    while ( $running > 0 );


    /* Process URL pages to find links to traverse
       ------------------------------------------- */
    foreach( $curlHandleArrayas $key => $curlHandle ){


        /* Grab content from a handle and close it
           --------------------------------------- */
        $urlContent = curl_multi_getcontent( $curlHandle );
        curl_multi_remove_handle( $multiCURLHandler, $curlHandle );
        curl_close( $curlHandle );          


        /* Place content in a DOMDocument for easy link processing
           ------------------------------------------------------- */
        $domDoc = new DOMDocument( '1.0' );
        $success = @$domDoc -> loadHTML( $urlContent );


        /* The Array to hold all the URLs to pass recursively
           -------------------------------------------------- */    
        $recursionURLsArray = array();


        /* Grab all the links from the DOMDocument and add to new URL array
           ---------------------------------------------------------------- */
        $anchors = $domDoc -> getElementsByTagName( 'a' );
        foreach( $anchors as $element ){
            // ---Clean the link
            // ---Check if the link is in $visited
            //    ---If so, continue;
            //    ---If not, add to $recursionURLsArray and $visitedUrlArray
        }


        /* Call the function recursively with the parsed URLs
           -------------------------------------------------- */
        $visitedUrlArray = crawler( $recursionURLsArray, $visitedUrlArray, $depth - 1 );

    }


    /* Close and unset variables
       ------------------------- */
    curl_multi_close( $multiCURLHandler );
    unset( $multiCURLHandler );
    unset( $curlHandleArray );

    return $visitedUrlArray;
}
?>

score 1 · Accepted Answer

这是你的问题：

 "I'm currently writing a spider function in PHP that recursively crawls across a website"

不要那样做。您将陷入无限循环并导致拒绝服务。您真正的问题不是内存不足。您真正的问题是您将关闭您正在抓取的网站。

真正的网络蜘蛛不会攻击您的网站并像您正在做的那样击中每一页繁荣繁荣。你这样做的方式更像是一种攻击，而不是合法的网络爬虫。它们被称为“爬行者”，因为它们“爬行”就像“走慢”一样。另外，合法的网络爬虫会读取 robots.txt 文件，而不是读取该文件所禁止的页面。

你应该更像这样：

阅读一页并将链接保存到 URL 具有 UNIQUE 约束的数据库，这样您就不会在其中多次获得相同的链接。该表还应该有一个状态字段来显示 url 是否已被读取。
从状态字段显示未读的数据库中获取一个 URL。阅读它，将它链接到的 url 保存到数据库中。更新数据库上的状态字段以显示其已被读取。

根据需要重复＃2..但以爬行的速度。

来自http://en.wikipedia.org/wiki/Web_crawler#Politeness_policy：

访问日志中的轶事证据表明，已知爬虫的访问间隔在 20 秒到 3-4 分钟之间变化。

php - 使用 curl_multi_getcontent() 时如何最小化 PHP Spider 的内存消耗？

1 回答 1

Related

Reference