performance - 如何在 Haskell 中将大数据块解析到内存中？

Question

回想起来，整个问题可以归结为更简洁的东西。我正在寻找一个 Haskell 数据结构

看起来像一个列表
有 O(1) 查找
有 O(1) 元素替换或O(1) 元素追加（或前置......如果是这种情况，我可以反转我的索引查找）。我总是可以考虑其中一种或另一种来编写我以后的算法。
内存开销很小

我正在尝试构建图像文件解析器。文件格式是基本的 8 位彩色 ppm 文件，但我打算支持 16 位彩色文件以及 PNG 和 JPEG 文件。现有的 Netpbm 库，尽管有很多拆箱注释，但在尝试加载我使用的文件时实际上会消耗所有可用内存：

3-10张照片，最小的45MB，最大的110MB。

现在，我无法理解 Netpbm 代码中的优化，所以我决定自己尝试一下。这是一个简单的文件格式...

我首先决定无论文件格式是什么，我都将以这种格式存储未压缩的最终图像：

import Data.Vector.Unboxed (Vector)
data PixelMap = RGB8 {
      width :: Int
    , height :: Int
    , redChannel :: Vector Word8
    , greenChannel :: Vector Word8
    , blueChannel :: Vector Word8
    }

然后我写了一个解析器，它可以处理三个向量，如下所示：

import Data.Attoparsec.ByteString
data Progress = Progress {
      addr      :: Int
    , size      :: Int
    , redC      :: Vector Word8
    , greenC    :: Vector Word8
    , blueC     :: Vector Word8
    }

parseColorBinary :: Progress -> Parser Progress
parseColorBinary progress@Progress{..}
    | addr == size = return progress
    | addr < size = do
        !redV <- anyWord8
        !greenV <- anyWord8
        !blueV <- anyWord8
        parseColorBinary progress { addr    = addr + 1
                                  , redC    = redC V.// [(addr, redV)]
                                  , greenC  = greenC V.// [(addr, greenV)]
                                  , blueC   = blueC V.// [(addr, blueV)] }

在解析器结束时，我像这样构造 RGB8：

Progress{..} <- parseColorBinary $ ...
return $ RGB8 width height redC greenC blueC

像这样写的程序，加载这些 45MB 图像中的一个，将消耗 5GB 或更多的内存。如果我更改了Progressso that redC、greenC和blueCare all的定义!(Vector Word8)，那么程序将保持在合理的内存范围内，但加载单个文件需要很长时间，以至于我不允许它实际完成。最后，如果我用标准列表替换这里的向量，我的内存使用量会回升到每个文件 5GB（我假设......我实际上在点击之前用完了空间），加载时间大约为 6 秒. Ubuntu 的预览应用程序一旦启动，就会几乎立即加载和呈现文件。

根据每次对 V.// 的调用实际上每次都完全复制向量的理论，我尝试切换到Data.Vector.Unboxed.Mutable，但是......我什至无法进行类型检查。文档不存在，理解数据类型在做什么也需要与多个其他库进行斗争。而且我什至不知道它是否能解决问题，所以我什至都不愿意尝试。

基本问题实际上非常简单：

我如何在不使用大量内存的情况下快速读取、保留并可能操作非常大的数据结构？我发现的所有示例都是关于临时生成大量数据，然后尽快将其删除。

原则上，我希望最终表示是不可变的，但我不太关心是否必须使用可变结构才能到达那里。

为了完整起见，完整的代码（BSD3 许可）位于https://bitbucket.org/savannidgerinel/photo-tools的 bitbucket 上。该分支包含一个严格版本的解析器，performance可以通过快速更改Progress.Codec.Image.Netpbm

运行性能测试

ulimit -Sv 6000000 -- set a ulimit of 6GB, or change to whatever makes sense for you
cabal build
dist/build/perf-test/perf-test +RTS -p -sstderr

score 4 · Accepted Answer

我首先认为只需读取整个字节串，然后将内容解压缩到未装箱的向量中就足够了。实际上，即使没有神秘的空间泄漏，您发布的解析代码也会相当糟糕：您在输入的每个字节上复制所有三个向量的整体！谈论二次复杂度。

所以我写了以下内容：

chunksOf3 :: [a] -> [(a, a, a)]
chunksOf3 (a:b:c:xs) = (a, b, c) : chunksOf3 xs
chunksOf3 _          = []

parseRGB :: Int -> Atto.Parser (Vector Word8, Vector Word8, Vector Word8)
parseRGB size = do
    input <- Atto.take (size * 3)
    let (rs, gs, bs) = unzip3 $ chunksOf3 $ B.unpack input
    return (V.fromList rs, V.fromList gs, V.fromList bs)

然后用一个 45 Mb 的随机字节文件对其进行测试。我承认我很惊讶这段代码导致了千兆字节的 RAM 使用。我很好奇空间究竟在哪里泄漏。

不过，可变向量效果很好。以下代码使用 133 Mb RAM，Criterion 将其基准测试为包含 60 ms 的文件读取。我在评论中加入了一些解释。在 ST monad 和 SO 和其他地方的可变向量上也有充足的材料（我同意图书馆文档对初学者不友好）。

import Data.Vector.Unboxed (Vector)
import Data.ByteString (ByteString)

import qualified Data.Vector.Unboxed as V
import qualified Data.ByteString as B
import qualified Data.Vector.Unboxed.Mutable as MV

import Control.Monad.ST.Strict 
import Data.Word
import Control.Monad
import Control.DeepSeq

-- benchmarking stuff
import Criterion.Main (defaultMainWith, bench, whnfIO)
import Criterion.Config (defaultConfig, Config(..), ljust)

-- This is just the part that parses the three vectors for the colors.
-- Of course, you can embed this into an Attoparsec computation by taking 
-- the current input, feeding it to parseRGB, or you can just take the right 
-- sized chunk in the parser and omit the "Maybe" test from the code below. 
parseRGB :: Int -> ByteString -> Maybe (Vector Word8, Vector Word8, Vector Word8)
parseRGB size input 
    | 3* size > B.length input = Nothing
    | otherwise = Just $ runST $ do

        -- We are allocating three mutable vectors of size "size"
        -- This is usually a bit of pain for new users, because we have to
        -- specify the correct type somewhere, and it's not an exactly simple type.
        -- In the ST monad there is always an "s" type parameter that labels the
        -- state of the action. A type of "ST s something" is a bit similar to
        -- "IO something", except that the inner type often also contains "s" as
        -- parameter. The purpose of that "s" is to statically disallow mutable
        -- variables from escaping the ST action. 
        [r, g, b] <- replicateM 3 $ MV.new size :: ST s [MV.MVector s Word8]

        -- forM_ = flip mapM_
        -- In ST code forM_ is a nicer looking approximation of the usual
        -- imperative loop. 
        forM_ [0..size - 1] $ \i -> do
            let i' = 3 * i
            MV.unsafeWrite r i (B.index input $ i'    )
            MV.unsafeWrite g i (B.index input $ i' + 1)
            MV.unsafeWrite b i (B.index input $ i' + 2)

        -- freeze converts a mutable vector living in the ST monad into 
        -- a regular vector, which can be then returned from the action
        -- since its type no longer depends on that pesky "s".
        -- unsafeFreeze does the conversion in place without copying.
        -- This implies that the original mutable vector should not be used after
        -- unsafeFreezing. 
        [r, g, b] <- mapM V.unsafeFreeze [r, g, b]
        return (r, g, b)

-- I prepared a file with 3 * 15 million random bytes.
inputSize = 15000000
benchConf = defaultConfig {cfgSamples = ljust 10}

main = do
    defaultMainWith benchConf (return ()) $ [
        bench "parseRGB test" $ whnfIO $ do 
            input <- B.readFile "randomInp.dat" 
            force (parseRGB inputSize input) `seq` putStrLn "done"
        ]

score 3 · Accepted Answer

这是一个直接从磁盘解析文件而不将任何中间文件加载到内存中的版本：

import Control.Applicative
import Control.Monad (void)
import Data.Attoparsec.ByteString (anyWord8)
import Data.Attoparsec.ByteString.Char8 (decimal)
import qualified Data.Attoparsec.ByteString as Attoparsec
import Data.ByteString (ByteString)
import Data.Vector.Unboxed (Vector)
import Data.Word (Word8)
import Control.Foldl (FoldM(..), impurely, vector, premapM) -- Uses `foldl-1.0.3`
import qualified Pipes.ByteString
import Pipes.Parse
import Pipes.Attoparsec (parse, parsed)
import qualified System.IO as IO

data PixelMap = PixelMap {
      width :: Int
    , height :: Int
    , redChannel :: Vector Word8
    , greenChannel :: Vector Word8
    , blueChannel :: Vector Word8
    } deriving (Show)

-- Fold three vectors simultaneously, ensuring strictness and efficiency
rgbVectors
    :: FoldM IO (Word8, Word8, Word8) (Vector Word8, Vector Word8, Vector Word8)
rgbVectors =
    (,,) <$> premapM _1 vector <*> premapM _2 vector <*> premapM _3 vector
  where
    _1 (a, b, c) = a
    _2 (a, b, c) = b
    _3 (a, b, c) = c

triples
    :: Monad m
    => Producer ByteString m r
    -> Producer (Word8, Word8, Word8) m ()
triples p = void $ parsed ((,,) <$> anyWord8 <*> anyWord8 <*> anyWord8) p

-- I will probably ask Renzo to simplify the error handling for `parse`
-- This is a helper function to just return `Nothing`
parse'
    :: Monad m
    => Attoparsec.Parser r -> Parser ByteString m (Maybe r)
parse' parser = do
    x <- parse parser
    return $ case x of
        Just (Right r) -> Just r
        _              -> Nothing

parsePixelMap :: Producer ByteString IO r -> IO (Maybe PixelMap)
parsePixelMap p = do
    let parseWH = do
            mw <- parse' decimal
            mh <- parse' decimal
            return ((,) <$> mw <*> mh)
    (x, p') <- runStateT parseWH p
    case x of
        Nothing     -> return Nothing
        Just (w, h) -> do
            let size = w * h
                parser = impurely foldAllM rgbVectors
                source = triples (p' >-> Pipes.ByteString.take size)
            (rs, gs, bs) <- evalStateT parser source
            return $ Just (PixelMap w h rs gs bs)

main = IO.withFile "image.ppm" IO.ReadMode $ \handle -> do
    pixelMap <- parsePixelMap (Pipes.ByteString.fromHandle handle)
    print pixelMap

我在一个 50 MB 的文件上测试了它，没有标题逻辑，它在大致相同的空间中运行。

performance - 如何在 Haskell 中将大数据块解析到内存中？

2 回答 2

Related

Reference