5

我正在尝试使用Haskell查找文件中字符的频率。我希望能够处理 ~500MB 大小的文件。

我到目前为止所尝试的

  1. 它完成了这项工作,但有点慢,因为它解析文件 256 次

    calculateFrequency :: L.ByteString -> [(Word8, Int64)]
    calculateFrequency f = foldl (\acc x -> (x, L.count x f):acc) [] [255, 254.. 0]
    
  2. 我也尝试过使用 Data.Map 但程序内存不足(在 ghc 解释器中)。

    import qualified Data.ByteString.Lazy as L
    import qualified Data.Map as M
    
    calculateFrequency' :: L.ByteString -> [(Word8, Int64)]
    calculateFrequency' xs = M.toList $ L.foldl' (\m word -> M.insertWith (+) word 1 m) (M.empty) xs
    
4

4 回答 4

14

这是一个使用可变的、未装箱的向量而不是更高级别的构造的实现。它还conduit用于读取文件以避免惰性 I/O。

import           Control.Monad.IO.Class
import qualified Data.ByteString             as S
import           Data.Conduit
import           Data.Conduit.Binary         as CB
import qualified Data.Conduit.List           as CL
import qualified Data.Vector.Unboxed.Mutable as VM
import           Data.Word                   (Word8)

type Freq = VM.IOVector Int

newFreq :: MonadIO m => m Freq
newFreq = liftIO $ VM.replicate 256 0

printFreq :: MonadIO m => Freq -> m ()
printFreq freq =
    liftIO $ mapM_ go [0..255]
  where
    go i = do
        x <- VM.read freq i
        putStrLn $ show i ++ ": " ++ show x

addFreqWord8 :: MonadIO m => Freq -> Word8 -> m ()
addFreqWord8 f w = liftIO $ do
    let index = fromIntegral w
    oldCount <- VM.read f index
    VM.write f index (oldCount + 1)

addFreqBS :: MonadIO m => Freq -> S.ByteString -> m ()
addFreqBS f bs =
    loop (S.length bs - 1)
  where
    loop (-1) = return ()
    loop i = do
        addFreqWord8 f (S.index bs i)
        loop (i - 1)

-- | The main entry point.
main :: IO ()
main = do
    freq <- newFreq
    runResourceT
        $  sourceFile "random"
        $$ CL.mapM_ (addFreqBS freq)
    printFreq freq

我在 500MB 的随机数据上运行了这个,并与@josejuan 的基于 UArray 的答案进行了比较:

  • 基于管道/可变向量:1.006s
  • UArray:17.962s

我认为应该可以保留 josejuan 的高级方法的大部分优雅,同时保持可变向量实现的速度,但我还没有机会尝试实现类似的东西。此外,请注意,使用一些通用帮助函数(如 Data.ByteString.mapM 或 Data.Conduit.Binary.mapM),实现可能会更加简单,而不会影响性能。

您也可以在 FP Haskell Center 上使用此实现

编辑:我添加了其中一个缺少的功能conduit并稍微清理了代码;它现在如下所示:

import           Control.Monad.Trans.Class   (lift)
import           Data.ByteString             (ByteString)
import           Data.Conduit                (Consumer, ($$))
import qualified Data.Conduit.Binary         as CB
import qualified Data.Vector.Unboxed         as V
import qualified Data.Vector.Unboxed.Mutable as VM
import           System.IO                   (stdin)

freqSink :: Consumer ByteString IO (V.Vector Int)
freqSink = do
    freq <- lift $ VM.replicate 256 0
    CB.mapM_ $ \w -> do
        let index = fromIntegral w
        oldCount <- VM.read freq index
        VM.write freq index (oldCount + 1)
    lift $ V.freeze freq

main :: IO ()
main = (CB.sourceHandle stdin $$ freqSink) >>= print

功能上的唯一区别是频率的打印方式。

于 2014-01-15T10:24:43.087 回答
7

@Alex 的回答很好,但是只有 256 个值(索引)的数组应该更好

import qualified Data.ByteString.Lazy as L
import qualified Data.Array.Unboxed as A
import qualified Data.ByteString as B
import Data.Int
import Data.Word

fq :: L.ByteString -> A.UArray Word8 Int64
fq = A.accumArray (+) 0 (0, 255) . map (\c -> (c, 1)) . concat . map B.unpack . L.toChunks

main = L.getContents >>= print . fq

@alex 代码占用(对于我的示例文件)24.81 段,使用数组占用 7.77 段。

更新:

虽然 Snoyman 解决方案更好,但改进unpack可能避免

fq :: L.ByteString -> A.UArray Word8 Int64
fq = A.accumArray (+) 0 (0, 255) . toCounterC . L.toChunks
     where toCounterC [] = []
           toCounterC (x:xs) = toCounter x (B.length x) xs
           toCounter  _ 0 xs = toCounterC xs
           toCounter  x i xs = (B.index x i', 1): toCounter x i' xs
                               where i' = i - 1

约 50% 的加速。

更新:

使用IOVector作为 Snoyman 是作为Conduit版本(确实快一点,但这是一个原始代码,更好地使用Conduit

import           Data.Int
import           Data.Word
import           Control.Monad.IO.Class
import qualified Data.ByteString.Lazy          as L
import qualified Data.Array.Unboxed            as A
import qualified Data.ByteString               as B
import qualified Data.Vector.Unboxed.Mutable   as V

fq :: L.ByteString -> IO (V.IOVector Int64)
fq xs =
     do
       v <- V.replicate 256 0 :: IO (V.IOVector Int64)
       g v $ L.toChunks xs
       return v
     where g v = toCounterC
                 where toCounterC [] = return ()
                       toCounterC (x:xs) = toCounter x (B.length x) xs
                       toCounter  _ 0 xs = toCounterC xs
                       toCounter  x i xs = do
                                             let i' = i - 1
                                                 w  = fromIntegral $ B.index x i'
                                             c <- V.read v w
                                             V.write v w (c + 1)
                                             toCounter x i' xs

main = do
          v <- L.getContents >>= fq
          mapM_ (\i -> V.read v i >>= liftIO . putStr . (++", ") . show) [0..255]
于 2014-01-15T09:22:57.557 回答
4

这适用于我的电脑:

module Main where
import qualified Data.HashMap.Strict as M
import qualified Data.ByteString.Lazy as L
import Data.Word
import Data.Int

calculateFrequency :: L.ByteString -> [(Word8, Int64)]
calculateFrequency xs = M.toList $ L.foldl' (\m word -> M.insertWith (+) word 1 m) M.empty xs

main = do
    bs <- L.readFile "E:\\Steam\\SteamApps\\common\\Sid Meier's Civilization V\\Assets\\DLC\\DLC_Deluxe\\Behind the Scenes\\Behind the Scenes.wmv"
    print (calculateFrequency bs)

不会耗尽内存,甚至不会加载整个文件,但在 600mb+ 文件上需要永远(大约一分钟)!我使用 ghc 7.6.3 编译了这个。

我应该指出,除了 strictHashMap而不是 lazy之外,代码基本上是相同的Map

请注意,这比在这种情况下insertWith快两倍。在我的机器上,编写的代码在 54 秒内执行,而使用的版本需要 107 秒。HashMapMapMap

于 2014-01-15T08:36:11.067 回答
0

我的两分钱(使用 STUArray)。无法将其与此处的其他解决方案进行比较。可能有人愿意尝试...

module Main where

import Data.Array.ST (runSTUArray, newArray, readArray, writeArray)
import Data.Array.Unboxed (UArray)
import qualified Data.ByteString.Lazy as L (ByteString, unpack, getContents)
import Data.Word
import Data.Int
import Control.Monad (forM_)

calculateFrequency :: L.ByteString -> UArray Word8 Int64 
calculateFrequency bs = runSTUArray $ do
    a <- newArray (0, 255) 0
    forM_ (L.unpack bs) $ \i -> readArray a i >>= writeArray a i . succ
    return a

main = L.getContents >>= print . calculateFrequency
于 2014-01-15T19:08:26.320 回答