haskell - 使用 map 和 ByteString 键对折叠的性能分析

Question

我有一个小脚本可以从 apache 日志文件中读取、解析和导出某种有趣的（不是真的）统计信息。到目前为止，我已经做了两个简单的选择，日志文件中所有请求中发送的字节总数，以及最常见的 10 个 IP 地址。

第一个“模式”只是所有已解析字节的简单总和。第二个是地图上的折叠（Data.Map），insertWith (+) 1'用于计算出现次数。

第一个按照我的预期运行，大部分时间都花在解析上，在恒定空间中。

42,359,709,344 字节分配在堆中 72,405,840 字节在 GC 期间复制 113,712 字节最大驻留（1553 个样本） 145,872 字节最大斜率 2 MB 正在使用的总内存（0 MB 由于碎片而丢失）

第 0 代：76311 次收集，
0 次并行，0.89 秒，0.99 秒过去
第 1 代：1553 次收集，0 次并行，0.21 秒，0.22 秒过去

INIT 时间 0.00s（经过 0.00s） MUT 时间 21.76s（经过 24.82s） GC 时间 1.10s（经过 1.20s） EXIT 时间
0.00s（经过 0.00s）总时间 22.87s（经过 26.02s）

%GC 时间 4.8%（经过 4.6%）

分配速率 1,946,258,962 字节/MUT 秒

生产力占总用户的 95.2%，占总使用时间的 83.6%

但是，第二个没有！

49,398,834,152 字节分配在堆中 580,579,208 字节在 GC 期间复制 718,385,088 字节最大驻留（15 个样本） 134,532,128 字节最大斜率 1393 MB 正在使用的总内存（172 MB 由于碎片而丢失）

第 0 代：91275 次收集，
0 次并行，252.65 秒，254.46 秒经过
第 1 代：15 次收集，0 次并行，0.12 秒，0.12 秒经过

INIT 时间 0.00s（经过 0.00s） MUT 时间 41.11s（经过 48.87s） GC 时间 252.77s（经过 254.58s） EXIT 时间
0.00s（经过 0.01s）总时间 293.88s（经过 303.45s）

%GC 时间 86.0%（经过 83.9%）

分配速率 1,201,635,385 字节/MUT 秒

生产力占总用户的 14.0%，占总使用时间的 13.5%

这是代码。

{-# LANGUAGE OverloadedStrings #-}

module Main where

import qualified Data.Attoparsec.Lazy as AL
import Data.Attoparsec.Char8 hiding (space, take)
import qualified Data.ByteString.Char8 as S
import qualified Data.ByteString.Lazy.Char8 as L
import Control.Monad (liftM)
import System.Environment (getArgs)
import Prelude hiding (takeWhile)
import qualified Data.Map as M
import Data.List (foldl', sortBy)
import Text.Printf (printf)
import Data.Maybe (fromMaybe)

type Command = String

data LogLine = LogLine {
    getIP     :: S.ByteString,
    getIdent  :: S.ByteString,
    getUser   :: S.ByteString,
    getDate   :: S.ByteString,
    getReq    :: S.ByteString,
    getStatus :: S.ByteString,
    getBytes  :: S.ByteString,
    getPath   :: S.ByteString,
    getUA     :: S.ByteString
} deriving (Ord, Show, Eq)

quote, lbrack, rbrack, space :: Parser Char
quote  = satisfy (== '\"')
lbrack = satisfy (== '[')
rbrack = satisfy (== ']')
space  = satisfy (== ' ')

quotedVal :: Parser S.ByteString
quotedVal = do
    quote
    res <- takeTill (== '\"')
    quote
    return res

bracketedVal :: Parser S.ByteString
bracketedVal = do
    lbrack
    res <- takeTill (== ']')
    rbrack
    return res

val :: Parser S.ByteString
val = takeTill (== ' ')

line :: Parser LogLine
l    ine = do
    ip <- val
    space
    identity <- val
    space
    user <- val
    space
    date <- bracketedVal
    space
    req <- quotedVal
    space
    status <- val
    space
    bytes <- val
    (path,ua) <- option ("","") combined
    return $ LogLine ip identity user date req status bytes path ua

combined :: Parser (S.ByteString,S.ByteString)
combined = do
    space
    path <- quotedVal
    space
    ua <- quotedVal
    return (path,ua)

countBytes :: [L.ByteString] -> Int
countBytes = foldl' count 0
    where
        count acc l = case AL.maybeResult $ AL.parse line l of
            Just x  -> (acc +) . maybe 0 fst . S.readInt . getBytes $ x
            Nothing -> acc

countIPs :: [L.ByteString] -> M.Map S.ByteString Int
countIPs = foldl' count M.empty
    where
        count acc l = case AL.maybeResult $ AL.parse line l of
            Just x -> M.insertWith' (+) (getIP x) 1 acc
            Nothing -> acc

---------------------------------------------------------------------------------

main :: IO ()
main = do
  [cmd,path] <- getArgs
  dispatch cmd path

pretty :: Show a => Int -> (a, Int) -> String
pretty i (bs, n) = printf "%d: %s, %d" i (show bs) n

dispatch :: Command -> FilePath -> IO ()
dispatch cmd path = action path
    where
        action = fromMaybe err (lookup cmd actions)
        err    = printf "Error: %s is not a valid command." cmd

actions :: [(Command, FilePath -> IO ())]
actions = [("bytes", countTotalBytes)
          ,("ips",  topListIP)]

countTotalBytes :: FilePath -> IO ()
countTotalBytes path = print . countBytes . L.lines =<< L.readFile path

topListIP :: FilePath -> IO ()
topListIP path = do
    f <- liftM L.lines $ L.readFile path
    let mostPopular (_,a) (_,b) = compare b a
        m = countIPs f
    mapM_ putStrLn . zipWith pretty [1..] . take 10 . sortBy mostPopular . M.toList $ m

编辑：

添加 +RTS -A16M 将 GC 降低到 20%。内存使用当然不变。

score 3 · Accepted Answer

我建议对代码进行以下更改：

@@ -1,4 +1,4 @@
-{-# LANGUAGE OverloadedStrings #-}
+{-# LANGUAGE BangPatterns, OverloadedStrings #-}

 module Main where

@@ -9,7 +9,7 @@
 import Control.Monad (liftM)
 import System.Environment (getArgs)
 import Prelude hiding (takeWhile)
-import qualified Data.Map as M
+import qualified Data.HashMap.Strict as M
 import Data.List (foldl', sortBy)
 import Text.Printf (printf)
 import Data.Maybe (fromMaybe)
@@ -17,15 +17,15 @@
 type Command = String

 data LogLine = LogLine {
-    getIP     :: S.ByteString,
-    getIdent  :: S.ByteString,
-    getUser   :: S.ByteString,
-    getDate   :: S.ByteString,
-    getReq    :: S.ByteString,
-    getStatus :: S.ByteString,
-    getBytes  :: S.ByteString,
-    getPath   :: S.ByteString,
-    getUA     :: S.ByteString
+    getIP     :: !S.ByteString,
+    getIdent  :: !S.ByteString,
+    getUser   :: !S.ByteString,
+    getDate   :: !S.ByteString,
+    getReq    :: !S.ByteString,
+    getStatus :: !S.ByteString,
+    getBytes  :: !S.ByteString,
+    getPath   :: !S.ByteString,
+    getUA     :: !S.ByteString
 } deriving (Ord, Show, Eq)

 quote, lbrack, rbrack, space :: Parser Char
@@ -39,14 +39,14 @@
     quote
     res <- takeTill (== '\"')
     quote
-    return res
+    return $! res

 bracketedVal :: Parser S.ByteString
 bracketedVal = do
     lbrack
     res <- takeTill (== ']')
     rbrack
-    return res
+    return $! res

 val :: Parser S.ByteString
 val = takeTill (== ' ')
@@ -67,14 +67,14 @@
     space
     bytes <- val
     (path,ua) <- option ("","") combined
-    return $ LogLine ip identity user date req status bytes path ua
+    return $! LogLine ip identity user date req status bytes path ua

 combined :: Parser (S.ByteString,S.ByteString)
 combined = do
     space
-    path <- quotedVal
+    !path <- quotedVal
     space
-    ua <- quotedVal
+    !ua <- quotedVal
     return (path,ua)

 countBytes :: [L.ByteString] -> Int
@@ -84,11 +84,11 @@
             Just x  -> (acc +) . maybe 0 fst . S.readInt . getBytes $ x
             Nothing -> acc

-countIPs :: [L.ByteString] -> M.Map S.ByteString Int
+countIPs :: [L.ByteString] -> M.HashMap S.ByteString Int
 countIPs = foldl' count M.empty
     where
         count acc l = case AL.maybeResult $ AL.parse line l of
-            Just x -> M.insertWith' (+) (getIP x) 1 acc
+            Just x -> M.insertWith (+) (getIP x) 1 acc
             Nothing -> acc

 ---------------------------------------------------------------------------------

我制作了严格的字段LogLine以避免它们包含引用与解析相关的表达式的thunk。使字段严格是一个好习惯，除非你真的需要它们变得懒惰。

我确保尽快创建解析结果（这是$!更改的一部分），也是为了避免延迟解析，直到您实际检查LogLine.

HashMap最后，我从unordered-containers package切换到了更好的数据结构。请注意，其中的所有函数Data.HashMap.Strict都是值严格的，这意味着我们可以使用普通insertWith变体。

ByteString请注意，由于共享底层存储，获取 a 的子字符串会强制将原始字符串保留在内存中（这与 Java 相同String）。如果要确保不保留额外的内存，请使用包中的copy函数bytestring。您可以尝试调用copy结果，(getIP x)看看是否有任何不同。这里的权衡是使用一些额外的计算来复制字符串以换取更低的空间使用。

请注意，使用-A<high number>倾向于提高短期运行程序（即基准测试）的性能，但不一定适用于实际程序。也一样-H。至少更高的-H值（例如 1G）不会影响程序的性能。

score 0 · Accepted Answer

最明显的一点是，您的第一个脚本可以在看到数据后立即丢弃数据，而第二个脚本必须保留它看到的所有内容。因此，您希望第二个脚本至少占用 O(N) 内存，而第一个脚本可以在恒定空间中运行。

您是否尝试过在打开堆分析的情况下运行？我可以在您的代码中可能发生过多分配的地方进行一些尝试，但没有什么可以替代硬数据。

我自己会怀疑 Data.Map.insertWith' 调用，因为每个调用都会将现有 Map 盈余的一部分呈现给需求，并且需要复制和重新平衡，但这纯粹是我的猜测。如果 insertWith' 调用是罪魁祸首，那么由于您不需要插页式 Map 条目，因此一次构建整个映射可能会更快（没有任何增量来计算 IP），然后进行第二次传递来进行计数. 这样您就不会浪费时间重新平衡地图。您还可以利用您的关键数据类型适合 Int 的事实（好吧，如果它至少是 IPv4 地址则可以）并使用 Data.IntMap 代替，它的内存开销要低得多。

haskell - 使用 map 和 ByteString 键对折叠的性能分析

2 回答 2

Related

Reference