php - PHP Crawler - 为什么readdir while循环在到达空文件夹时退出？

Question

我有以下函数作为一个更大的程序的一部分，它爬取提供的路径的内容，并索引它在父文件夹或任何子文件夹中找到的任何 .htm 或 .html 页面。我的爬虫函数（如下）是递归的，似乎工作得很好，直到它到达一个不包含任何项目的子文件夹。

这似乎是一个常见问题，通常通过如下构造 while 循环来解决：

while ( false !== ($file = readdir($folder)) )

但这不起作用。得到输出的最后一行是'当前爬虫路径是......'，然后输出就停止了。我猜问题出在空文件夹和 readdir 函数上，但我不知道如何解决。有人可以提供建议吗？

谢谢

function crawlFolders($path)
{
    $prevPath = $path;  // variable to keep track of the previous file path
    chdir($path);
    $folder = opendir($path);

    echo "The current crawler path is ".$path."<br>";

    while ( false !== ($file = readdir($folder)) ) // read current directory item, then advance pointer
    {   
        if ( is_file($file) )
        {   echo "File found!  The crawler is inspecting to see if it can be indexed<br>";
            if ( canIndex($path."/".$file) )
                indexPage($path."/".$file);
        }

        else if ( is_dir($file) ) 
        {
            //it's a folder, we must crawl
            if ( ($file != ".") && ($file != "..") )    //it's a folder, we must crawl
            {
                echo "$file is a folder<br><br>";
                crawlFolders($path."/".$file);
                chdir($prevPath); // change the working dir back to that of the calling fn

            }
        }   
    }
    closedir($folder);

}

在看了这个之后，我看不出为什么 readdir 会导致问题。我认为问题可能是我的 crawlFolders 函数本身并没有展开，而是在它到达最深的空文件夹时才结束。我是否遗漏了递归应该工作的方式？我的印象是，一旦 while 循环返回 false，递归函数调用就会退出，从而将我放到前面进行递归调用的 crawlFolders 函数（即展开自身）。

每次 crawlFolders 退出时我是否需要返回一个值，以便调用函数知道在哪里恢复自己？

看来递归确实是问题所在。我在空文件夹中放置了一个文件，我的索引器工作了，但这些功能仍然没有按我的意愿展开。我知道这不会发生，因为起始路径中仍有两个文件未评估。

score 1 · Accepted Answer

问题不在于递归，而很可能是当前工作目录。

您使用更改当前目录chdir()，然后$file将相对文件名指定给is_file()和is_dir()。执行从递归返回后，当前目录仍然是子目录，因此is_file($file)不会is_dir($file)找到文件。

您必须在进入递归之前保存当前目录，或者 - 更好 -chdir()完全避免并使用完整路径：is_file($path . '/' . $file)

php - PHP Crawler - 为什么readdir while循环在到达空文件夹时退出？

1 回答 1

Related

Reference