haskell - 使用 Iteratee 库编写“wc -l” - 如何过滤换行符？

Question

我正在尝试使用 Haskell Iteratee 库提出等效的“wc -l”。下面是“wc”的代码（它只计算单词 - 类似于 hackage 上的 iteratee 示例中的代码），并且运行速度非常快：

{-# LANGUAGE BangPatterns #-}
import Data.Iteratee as I
import Data.ListLike as LL
import Data.Iteratee.IO
import Data.ByteString


length1 :: (Monad m, Num a, LL.ListLike s el) => Iteratee s m a
length1 = liftI (step 0)
  where
    step !i (Chunk xs) = liftI (step $ i + fromIntegral (LL.length xs))
    step !i stream     = idone i stream
{-# INLINE length1 #-}
main = do
  i' <- enumFile 1024 "/usr/share/dict/words" (length1 :: (Monad m) => Iteratee ByteString m Int)
  result <- run i'
  print result
  {- Time measured on a linux x86 box: 
  $ time ./test ## above haskell compiled code
  4950996

  real    0m0.013s
  user    0m0.004s
  sys     0m0.007s

  $  time wc -c /usr/share/dict/words
  4950996 /usr/share/dict/words

  real    0m0.003s
  user    0m0.000s
  sys     0m0.002s
  -}

现在，如何扩展它来计算运行速度过快的行数？我做了一个版本，使用 Prelude.filter 仅过滤“\n”到长度，但它比 linux“wc -l”慢，因为内存太多，而且 gc（我猜是懒惰的评估）。所以，我使用 Data.ListLike.filter 编写了另一个版本，但它不会编译，因为它没有类型检查 - 在这里的帮助将不胜感激：

  {-# LANGUAGE BangPatterns #-}
  import Data.Iteratee as I
  import Data.ListLike as LL
  import Data.Iteratee.IO
  import Data.ByteString
  import Data.Char
  import Data.ByteString.Char8 (pack)

  numlines :: (Monad m, Num a, LL.ListLike s el) => Iteratee s m a
  numlines = liftI $ step 0
    where
      step !i (Chunk xs) = liftI (step $i + fromIntegral (LL.length $ LL.filter (\x ->  x == Data.ByteString.Char8.pack "\n")  xs))
      step !i stream = idone i stream
  {-# INLINE numlines #-}

  main = do
    i' <- enumFile 1024 "/usr/share/dict/words" (numlines :: (Monad m) => Iteratee ByteString m Int)
    result <- run i'
    print result

score 3 · Accepted Answer

所以我做了一些实验，我得到了一个 wc -l，它的速度只有“wc -l”的两倍。这甚至比上面显示的 wc -c 版本的性能还要好。

{-# LANGUAGE OverloadedStrings #-}

import qualified Data.ByteString.Lazy.Char8 as BSL
import qualified Data.ByteString.Char8 as BS
import qualified Data.Enumerator as E
import qualified Data.Enumerator.Binary as EB
import Control.Monad.IO.Class (liftIO)
import Data.Int

numlines :: Int64 -> E.Iteratee BS.ByteString IO ()
numlines n = do 
    chunk <- EB.take 1024
    case chunk of 
        "" -> do liftIO $ print n
                 return ()
        a -> do let ct = BSL.count '\n' a
                numlines (n+ct)

main = do 
    let i = EB.enumFile "/usr/share/dict/words" E.$$ numlines 0
    E.run_ i

运行它与原生：

Eriks-MacBook-Air:skunk erikhinton$ time wc -l "/usr/share/dict/words"
  235886 /usr/share/dict/words

real    0m0.009s
user    0m0.006s
sys 0m0.002s
Eriks-MacBook-Air:skunk erikhinton$ time ./wcl
235886

real    0m0.019s
user    0m0.013s
sys 0m0.005s

[编辑]

这是一种更快、更小、更简洁/更具表现力的方式。这些枚举器开始变得有趣。

{-# LANGUAGE OverloadedStrings #-}

import qualified Data.ByteString.Lazy.Char8 as BSL
import qualified Data.ByteString.Char8 as BS
import qualified Data.Enumerator as E
import qualified Data.Enumerator.Binary as EB
import qualified Data.Enumerator.List as EL
import Control.Monad.IO.Class (liftIO)
import Data.Int

numlines :: E.Iteratee BS.ByteString IO ()
numlines = do
           num <- EL.fold (\n b -> (BS.count '\n' b) + n ) 0
           liftIO . print $ num

main = do 
    let i = EB.enumFile "/usr/share/dict/words" E.$$ numlines
    E.run_ i

和时机

Eriks-MacBook-Air:skunk erikhinton$ time ./wcl2
235886

real    0m0.015s
user    0m0.010s
sys 0m0.004s

score 1 · Accepted Answer

如果您正在阅读ByteString块，则可以使用countfrom 函数Data.ByteString，相关步骤将是

step !i (Chunk xs) = liftI (step $ i + count 10 xs)

（也许使用 fromIntegral）。Data.ByteString.count相当快，不应该比 wc -l 慢太多。

score 1 · Accepted Answer

我想出了如何修复类型错误。修复类型错误的关键是了解Data.ListLike.filter和传递给该过滤器的ByteString输入之间的关系。这是 Data.ListLike.filter 的类型：

Data.ListLike.filter
:: Data.ListLike.Base.ListLike full item =>
 (item -> Bool) -> full -> full

如果我理解正确的话， full指的是枚举器/迭代器上下文中的流。item指的是流的元素。

现在，如果我们想过滤输入文件中的换行符，我们必须知道输入文件流的类型，以及该流中元素的类型。在这种情况下，输入文件被读取为ByteString流。ByteString被记录为 Word8 向量的节省空间的表示。所以，这里的项目类型是 Word8。

因此，当我们编写过滤器时，在 step 函数中，我们必须确保为 Word8 定义了 Bool 操作，因为这是传递给过滤器的项目的类型（如上所述）。我们正在过滤换行符。因此，像下面这样构建换行符的 Word8 表示并检查 Word8 类型的 x 是否相等的 bool 函数应该可以工作：

\x ->  x ==  Data.ByteString.Internal.c2w '\n'

还有一个缺失的部分 - 由于某些原因，编译器（v7.0.3 Mac）无法在 numfile 类型签名中推断出el的类型（如果有人知道为什么会这样，请讨论）。因此，明确告诉它是 Word8 可以解决编译问题：

numlines :: (Monad m, Num a, LL.ListLike s Word8) => Iteratee s m a

下面的完整代码 - 它编译并运行得非常快。

{-# LANGUAGE BangPatterns,FlexibleContexts #-}
import Data.Iteratee as I
import Data.ListLike as LL
import Data.Iteratee.IO
import Data.ByteString
import GHC.Word (Word8)
import Data.ByteString.Internal (c2w)

numlines :: (Monad m, Num a, LL.ListLike s Word8) => Iteratee s m a
numlines = liftI $ step 0
  where
    step !i (Chunk xs) = let newline = c2w '\n' in liftI (step $i + fromIntegral (LL.length $ LL.filter (\x ->  x == newline) xs))
    step !i stream = idone i stream
{-# INLINE numlines #-}


main = do
  i' <- enumFile 1024 "/usr/share/dict/words" (numlines :: (Monad m) => Iteratee ByteString m Int)
  result <- run i'
  print result
{- Time to run on mac OSX:

$ time ./test ## above compiled program: ghc --make -O2 test.hs
235886

real  0m0.011s
user  0m0.007s
sys 0m0.004s
$ time wc -l /usr/share/dict/words 
235886 /usr/share/dict/words

real  0m0.005s
user  0m0.002s
sys 0m0.002s
-}

score 1 · Accepted Answer

已经有很多好的答案了；在性能方面，我几乎没有什么可提供的，但有一些风格要点。

首先，我会这样写：

import Prelude as P
import Data.Iteratee
import qualified Data.Iteratee as I
import qualified Data.Iteratee.IO as I
import qualified Data.ByteString as B
import Data.Char
import System.Environment

-- numLines has a concrete stream type so it's not necessary to provide an
-- annotation later.  It could have a more general type.
numLines :: Monad m => I.Iteratee B.ByteString m Int
numLines = I.foldl' step 0
 where
  --step :: Int -> Word8 -> Int
  step acc el = if el == (fromIntegral $ ord '\n') then acc + 1 else acc

main = do
  f:_   <- getArgs
  words <- run =<< I.enumFile 65536 f numLines
  print words

最大的不同是这个使用Data.Iteratee.ListLike.foldl'. 请注意，只有单个流元素对阶跃函数很重要，而不是流类型。它与您在 eg 中使用的功能完全相同Data.ByteString.Lazy.foldl'。

使用foldl'也意味着您不需要手动编写迭代器liftI。除非绝对必要，否则我会劝阻用户不要这样做。结果通常更长，更难维持，几乎没有好处。

最后，我显着增加了缓冲区大小。在我的系统上，这比enumerator默认的 4096 略快，这又比您选择的 1024 略快（使用 iteratee）。当然，使用此设置的 YMMV。

haskell - 使用 Iteratee 库编写“wc -l” - 如何过滤换行符？

4 回答 4

Related

Reference