haskell - 目录树的广度优先遍历不是懒惰的

Question

我尝试遍历目录树。天真的深度优先遍历似乎不会以惰性方式生成数据并且会耗尽内存。接下来我尝试了一种广度优先的方法，它显示了同样的问题——它使用了所有可用的内存，然后崩溃了。

我的代码是：

getFilePathBreadtFirst :: FilePath -> IO [FilePath]
getFilePathBreadtFirst fp = do
  fileinfo <- getInfo fp 
  res :: [FilePath]  <- if isReadableDirectory fileinfo
          then do
                children  <- getChildren fp 
                lower    <-  mapM getFilePathBreadtFirst children  
                return (children ++  concat lower)
           else return [fp]        -- should only return the files? 
  return res 

getChildren :: FilePath -> IO [FilePath]
getChildren path = do 
          names <- getUsefulContents path
          let namesfull = map (path </>) names
          return namesfull

testBF fn = do  -- crashes for /home/frank, does not go to swap 
  fps <- getFilePathBreadtFirst fn
  putStrLn $ unlines fps

我认为所有代码都是线性或尾递归的，我希望文件名列表立即开始，但实际上并非如此。我的代码和我的想法中的错误在哪里？我在哪里失去了懒惰的评价？

score 7 · Accepted Answer

我将使用三个不同的技巧来解决您的问题。

技巧 1：使用pipes库在遍历树的同时流式传输文件名。
技巧2：使用StateT (Seq FilePath)transformer实现广度优先遍历。
技巧3：MaybeT在编写循环和退出时使用转换器避免手动递归。

以下代码将这三个技巧组合在一个 monad 转换器堆栈中。

import Control.Monad
import Control.Monad.Trans
import Control.Monad.Trans.Maybe
import Control.Monad.State.Lazy
import Control.Pipe
import Data.Sequence
import System.FilePath.Posix
import System.Directory

loop :: (Monad m) => MaybeT m a -> m ()
loop = liftM (maybe () id) . runMaybeT . forever

quit :: (Monad m) => MaybeT m a
quit = mzero

getUsefulContents :: FilePath -> IO [FilePath]
getUsefulContents path
  = fmap (filter (`notElem` [".", ".."])) $ getDirectoryContents path

permissible :: FilePath -> IO Bool
permissible file
  = fmap (\p -> readable p && searchable p) $ getPermissions file

traverseTree :: FilePath -> Producer FilePath IO ()
traverseTree path = (`evalStateT` empty) $ loop $ do
    -- All code past this point uses the following monad transformer stack:
    -- MaybeT (StateT (Seq FilePath) (Producer FilePath IO)) ()
    let liftState = lift
        liftPipe  = lift . lift
        liftIO    = lift . lift . lift
    liftState $ modify (|> path)
    forever $ do
        x <- liftState $ gets viewl
        case x of
            EmptyL    -> quit
            file :< s -> do
                liftState $ put s
                liftPipe $ yield file
                p <- liftIO $ doesDirectoryExist file
                when p $ do
                    names <- liftIO $ getUsefulContents file
                    -- allowedNames <- filterM permissible names
                    let namesfull = map (path </>) names
                    liftState $ forM_ namesfull $ \name -> modify (|> name)

这将创建一个广度优先文件名生成器，可以与树遍历同时使用。您使用以下方式使用这些值：

printer :: (Show a) => Consumer a IO r
printer = forever $ do
    a <- await
    lift $ print a

>>> runPipe $ printer <+< traverseTree path
<Prints file names as it traverses the tree>

您甚至可以选择不要求所有值：

-- Demand only 'n' elements
take' :: (Monad m) => Int -> Pipe a a m ()
take' n = replicateM_ n $ do
    a <- await
    yield a

>> runPipe $ printer <+< take' 3 <+< traverseTree path
<Prints only three files>

更重要的是，最后一个示例只会尽可能多地遍历树以生成三个文件，然后它将停止。当您想要的只是 3 个结果时，这可以防止浪费地遍历整个树！

要了解有关pipes库技巧的更多信息，请参阅管道教程。Control.Pipes.Tutorial

要了解有关循环技巧的更多信息，请阅读此博客文章。

我找不到用于广度优先遍历的队列技巧的良好链接，但我知道它就在某个地方。如果其他人知道一个很好的链接，只需编辑我的答案以添加它。

score 0 · Accepted Answer

我已经将管道的管理和树的遍历分开了。这里首先是管道的代码（基本上是冈萨雷斯的代码 - 谢谢！）：

traverseTree :: FilePath -> Producer FilePath IO ()
-- ^ traverse a tree in breadth first fashion using an external doBF function 
traverseTree path = (`evalStateT` empty) $ loop $ do
-- All code past this point uses the following monad transformer stack:
-- MaybeT (StateT (Seq FilePath) (Producer FilePath IO)) ()
let liftState = lift
    liftPipe  = lift . lift
    liftIO    = lift . lift . lift
liftState $ modify (|> path)
forever $ do
    x <- liftState $ gets viewl
    case x of
        EmptyL    -> quit
        file :< s -> do
            (yieldval, nextInputs) <- liftIO $ doBF file 
            liftState $ put s
            liftPipe $ yield yieldval
            liftState $ forM_ nextInputs $ \name -> modify (|> name)

接下来是树遍历的代码：

doBF :: FilePath -> IO (FilePath, [FilePath])
doBF file = do 
    finfo <- getInfo file
    let p =  isReadableDirectoryNotLink finfo
    namesRes <- if p then do
        names :: [String] <- liftIO $ getUsefulContents file
        let namesSorted = sort names
        let namesfull = map (file </>) namesSorted
        return namesfull
        else return []          
    return (file, namesRes)

我希望把doBF换成类似的函数，先遍历深度。我假设我可以使 traverseTree 更通用，不仅适用于 FilePath ~ String，而且我看不到序列上的空函数在哪个类中。一般可能有用。

haskell - 目录树的广度优先遍历不是懒惰的

2 回答 2

Related

Reference