performance - 按广度优先顺序列出目录的所有内容会导致效率低下

Question

我编写了一个 Haskell 模块来按广度优先顺序列出目录的所有内容。下面是源代码。

module DirElements (dirElem) where

import System.Directory (getDirectoryContents, doesDirectoryExist)
import System.FilePath ((</>))

dirElem :: FilePath -> IO [[FilePath]]
dirElem dirPath = iterateM (not.null) (concatMapM getDirectoryContents') [dirPath] >>= return.tail

getDirectoryContents' :: FilePath -> IO [FilePath]
getDirectoryContents' dirPath = do
  isDir <- do doesDirectoryExist dirPath
  if isDir then dirContent else return [] where
    dirContent = do
      contents <- getDirectoryContents dirPath
      return.(map (dirPath</>)).tail.tail $ contents

iterateM :: (Monad m) => (a -> Bool) -> (a -> m a) -> a -> m [a]
iterateM fb f x = do --Notice: Due to the the implementation of >>=, iterateM can't be writen like iterate which gives a infinite list and have type of iterateM :: (Monad m) => (a -> Bool) -> (a -> m a) -> a -> m [a]
  if fb x
    then do
      tail <- do {fx <- f x; iterateM fb f fx}
      return (x:tail)
    else return []

concatMapM :: Monad m => (a -> m[b]) -> [a] -> m[b]
concatMapM f list = mapM f list >>= return.concat

它工作正常，但在大目录上执行时，它会“暂停”一会儿，并弹出所有结果。

经过研究，我发现这是同一个问题，sequence $ map return [1..]::[[Int]]请参阅Why the Haskell sequence function can't be lazy or why recursive monadic functions can't be lazy

score 12 · Accepted Answer

这每隔一段时间就会出现一次，答案最终是使用类似库的迭代器。最近最常被推荐的是代理库。

Haskell中目录的流式递归下降
较旧的管道解决方案已过时且不可迭代，如解决方案广度优先遍历目录树并不懒惰

我以前见过 Conduit 解决方案和一些优雅的单子解决方案，但我现在找不到它们。

score 9 · Accepted Answer

首先，这与严格无关。像许多 monad 一样，IO 实际上在其 monadic 操作中是非严格的。这与惰性 I/O 与急切 I/O 有关。

问题是您首先进行目录遍历，然后处理结果。您可以通过使用协程交错来改善这一点。一种简单的方法是使目录遍历将回调作为参数：

getDirectoryContents' :: (MonadIO m) => (FilePath -> m a) -> FilePath -> m ()
getDirectoryContents' k fp = {- ... -}

这是最简单且最不灵活的解决方案。更灵活的解决方案是实际实现协程。您可以使用free、monad-coroutine或operation来滚动您自己的协程 monad ，或者您可以使用许多流式抽象中的一种，例如管道、枚举器或管道，最后一个是我个人对像这样的简单案例的个人建议。

score 6 · Accepted Answer

我修改了 Davorak 链接到的旧答案以使用新pipes库。

它用于StateP保留未遍历目录的队列，以便可以进行广度优先遍历。为了方便，它MaybeP用于退出循环。

import Control.Monad
import Control.Proxy
import Control.Proxy.Trans.Maybe
import Control.Proxy.Trans.State as S
import Data.Sequence hiding (filter)
import System.FilePath.Posix
import System.Directory

getUsefulContents :: FilePath -> IO [FilePath]
getUsefulContents path
  = fmap (filter (`notElem` [".", ".."])) $ getDirectoryContents path

traverseTree
    :: (Proxy p)
    => FilePath
    -> () -> Producer (MaybeP (StateP (Seq FilePath) p)) FilePath IO r
traverseTree path () = do
    liftP $ S.modify (|> path)
    forever $ do
        x <- liftP $ S.gets viewl
        case x of
            EmptyL    -> mzero
            file :< s -> do
                liftP $ S.put s
                respond file
                p <- lift $ doesDirectoryExist file
                when p $ do
                    names <- lift $ getUsefulContents file
                    let namesfull = map (file </>) names
                    liftP $ forM_ namesfull $ \name ->
                        S.modify (|> name)

这定义了一个广度优先的惰性文件生产者。如果将它连接到打印阶段，它将在遍历树时打印出文件：

main = runProxy $ evalStateK empty $ runMaybeK $
    traverseTree "/tmp" >-> putStrLnD

懒惰意味着如果你只需要 3 个文件，它只会遍历树尽可能多地生成三个文件，然后它会停止：

    main = runProxy $ evalStateK empty $ runMaybeK $
        traverseTree "/tmp" >-> takeB_ 3 >-> putStrLnD

如果您想了解有关该pipes库的更多信息，那么我建议您阅读该教程。

score 3 · Accepted Answer

每个人都在告诉您使用迭代器或管道等，这是当前流行的方法。但是还有另一种经典的方法来做到这一点！只需使用unsafeInterleaveIOfrom System.IO.Unsafe。类型的所有功能IO a -> IO a都是修改 IO 操作，以便它仅在需要值 thunk 时才实际执行 IO，这正是您所要求的。您可以使用它轻松地编写iterateM具有所需语义的 an。

像这样的例子是unsafeInterleaveIO闪耀的地方。

但是，我敢肯定，您已经注意到名称中的“不安全” - 还有其他示例，您希望直接控制文件句柄和资源使用等，这unsafeInterleaveIO确实是坏消息，甚至可能引入违反参考透明度。

（有关更多讨论，请参阅此答案：UnsafeInterleaveIO 何时不安全？）

但同样，在这样的情况下，我认为这unsafeInterleaveIO是明显、正确和直接的结果。

performance - 按广度优先顺序列出目录的所有内容会导致效率低下

4 回答 4

Related

Reference