performance - 分析 Haskell 程序

Question

我有一段代码使用sequence. 从道德上讲，它做了这样的事情：

sampleMean :: MonadRandom m => Int -> m Float -> m Float
sampleMean n dist = do
  xs <- sequence (replicate n dist)
  return (sum xs)

除了它有点复杂。我感兴趣的实际代码是likelihoodWeighting这个Github repo中的函数。

我注意到运行时间与n. 特别是一旦n超过某个值，就会达到内存限制，运行时间就会爆炸。我不确定，但我认为这是因为sequence正在建立一个长长的 thunk 列表，直到调用sum.

一旦我超过了大约 100,000 个样本，程序就会慢下来。我想对此进行优化（我的感觉是 1000 万个样本不应该成为问题）所以我决定对其进行分析 - 但我在理解分析器的输出时遇到了一些麻烦。

剖析

我在一个文件中创建了一个简短的可执行文件，该文件main.hs使用 100,000 个样本运行我的函数。这是做的输出

$ ghc -O2 -rtsopts main.hs
$ ./main +RTS -s

我注意到的第一件事 - 它分配了近 1.5 GB 的堆，并将 60% 的时间用于垃圾收集。这通常表明过于懒惰吗？

 1,377,538,232 bytes allocated in the heap
 1,195,050,032 bytes copied during GC
   169,411,368 bytes maximum residency (12 sample(s))
     7,360,232 bytes maximum slop
           423 MB total memory in use (0 MB lost due to fragmentation)

Generation 0:  2574 collections,     0 parallel,  2.40s,  2.43s elapsed
Generation 1:    12 collections,     0 parallel,  1.07s,  1.28s elapsed

INIT  time    0.00s  (  0.00s elapsed)
MUT   time    1.92s  (  1.94s elapsed)
GC    time    3.47s  (  3.70s elapsed)
RP    time    0.00s  (  0.00s elapsed)
PROF  time    0.23s  (  0.23s elapsed)
EXIT  time    0.00s  (  0.00s elapsed)
Total time    5.63s  (  5.87s elapsed)

%GC time      61.8%  (63.1% elapsed)

Alloc rate    716,368,278 bytes per MUT second

Productivity  34.2% of total user, 32.7% of total elapsed

以下是来自的结果

$ ./main +RTS -p

第一次运行时，发现有一个函数被重复调用，结果我可以记住它，这使速度提高了 2 倍。然而，它并没有解决空间泄漏问题。

COST CENTRE           MODULE                no. entries  %time %alloc   %time %alloc

MAIN                  MAIN                    1        0   0.0    0.0   100.0  100.0
 main                 Main                  434        4   0.0    0.0   100.0  100.0
  likelihoodWeighting AI.Probability.Bayes  445        1   0.0    0.3   100.0  100.0
   distributionLW     AI.Probability.Bayes  448        1   0.0    2.6     0.0    2.6
   getSampleLW        AI.Probability.Bayes  446   100000  20.0   50.4   100.0   97.1
    bnProb            AI.Probability.Bayes  458   400000   0.0    0.0     0.0    0.0
    bnCond            AI.Probability.Bayes  457   400000   6.7    0.8     6.7    0.8
    bnVals            AI.Probability.Bayes  455   400000  20.0    6.3    26.7    7.1
     bnParents        AI.Probability.Bayes  456   400000   6.7    0.8     6.7    0.8
    bnSubRef          AI.Probability.Bayes  454   800000  13.3   13.5    13.3   13.5
    weightedSample    AI.Probability.Bayes  447   100000  26.7   23.9    33.3   25.3
     bnProb           AI.Probability.Bayes  453   100000   0.0    0.0     0.0    0.0
     bnCond           AI.Probability.Bayes  452   100000   0.0    0.2     0.0    0.2
     bnVals           AI.Probability.Bayes  450   100000   0.0    0.3     6.7    0.5
      bnParents       AI.Probability.Bayes  451   100000   6.7    0.2     6.7    0.2
     bnSubRef         AI.Probability.Bayes  449   200000   0.0    0.7     0.0    0.7

这是一个堆配置文件。我不知道为什么它声称运行时间是 1.8 秒——这次运行大约需要 6 秒。

在此处输入图像描述

谁能帮我解释分析器的输出 - 即确定瓶颈在哪里，并就如何加快速度提供建议？

score 14 · Accepted Answer

通过结合JohnL 的使用foldMin的建议，已经实现了巨大的改进likelihoodWeighting。这将这里的内存使用量减少了大约十倍，并将 GC 时间显着降低到几乎或实际上可以忽略不计。

使用当前源运行分析会产生

probabilityIO              AI.Util.Util          26.1   42.4    413 290400000
weightedSample.go          AI.Probability.Bayes  16.1   19.1    255 131200080
bnParents                  AI.Probability.Bayes  10.8    1.2    171   8000384
bnVals                     AI.Probability.Bayes  10.4    7.8    164  53603072
bnCond                     AI.Probability.Bayes   7.9    1.2    125   8000384
ndSubRef                   AI.Util.Array          4.8    9.2     76  63204112
bnSubRef                   AI.Probability.Bayes   4.7    8.1     75  55203072
likelihoodWeighting.func   AI.Probability.Bayes   3.3    2.8     53  19195128
%!                         AI.Util.Util           3.3    0.5     53   3200000
bnProb                     AI.Probability.Bayes   2.5    0.0     40        16
bnProb.p                   AI.Probability.Bayes   2.5    3.5     40  24001152
likelihoodWeighting        AI.Probability.Bayes   2.5    2.9     39  20000264
likelihoodWeighting.func.x AI.Probability.Bayes   2.3    0.2     37   1600000

和报告的 13MB 内存使用量-s，~5MB 最大驻留。这已经不算太糟糕了。

尽管如此，我们仍有一些可以改进的地方。首先，在大计划中，一件相对较小的事情AI.UTIl.Array.ndSubRef：

ndSubRef :: [Int] -> Int
ndSubRef ns = sum $ zipWith (*) (reverse ns) (map (2^) [0..])

反转列表并映射(2^)到另一个列表效率低下，更好的是

ndSubRef = L.foldl' (\a d -> 2*a + d) 0

它不需要将整个列表保留在内存中（可能不是什么大问题，因为列表会很短），因为它可以反转它，并且不需要分配第二个列表。分配的减少是显着的，大约 10%，而且这部分运行得更快，

ndSubRef                   AI.Util.Array          1.7    1.3     24   8000384

在修改运行的profile中，但由于只占用了整体时间的一小部分，所以整体影响很小。有潜在的更大的鱼可以weightedSample煎炸likelihoodWeighting。

让我们添加一些严格性，weightedSample看看它是如何改变事物的：

weightedSample :: Ord e => BayesNet e -> [(e,Bool)] -> IO (Map e Bool, Prob)
weightedSample bn fixed =
    go 1.0 (M.fromList fixed) (bnVars bn)
    where
        go w assignment []     = return (assignment, w)
        go w assignment (v:vs) = if v `elem` vars
            then
                let w' = w * bnProb bn assignment (v, fixed %! v)
                in go w' assignment vs
            else do
                let p = bnProb bn assignment (v,True)
                x <- probabilityIO p
                go w (M.insert v x assignment) vs

        vars = map fst fixed

的 weight 参数go从不强制，assignment 参数也不是，因此它们可以建立 thunk。让我们启用{-# LANGUAGE BangPatterns #-}并强制更新立即生效，并p在将其传递给之前进行评估probabilityIO：

go w assignment (v:vs) = if v `elem` vars
    then
        let !w' = w * bnProb bn assignment (v, fixed %! v)
        in go w' assignment vs
    else do
        let !p = bnProb bn assignment (v,True)
        x <- probabilityIO p
        let !assignment' = M.insert v x assignment
        go w assignment' vs

这带来了分配的进一步减少 (~9%) 和小幅加速 (~%13%)，但总内存使用量和最大驻留没有太大变化。

我没有看到其他明显的变化，所以让我们看看likelihoodWeighting：

func m _ = do
    (a, w) <- weightedSample bn fixed
    let x = a ! e
    return $! x `seq` w `seq` M.adjust (+w) x m

在最后一行中，first，w现在已经被评估了weightedSample，所以我们在这里不需要seq它，keyx是评估更新后的地图所必需的，所以seq这也不是必需的。那条线上的坏事是M.adjust。adjust无法强制更新函数的结果，因此会在地图的值中构建 thunk。您可以通过查找修改后的值并强制执行对 thunk 的评估，但Data.Map这里提供了一种更方便的方法，因为保证存在更新地图的键，insertWith'：

func !m _ = do
    (a, w) <- weightedSample bn fixed
    let x = a ! e
    return (M.insertWith' (+) x w m)

m（注意：GHC 使用 bang-pattern比使用return $! ...这里优化得更好）。这略微减少了总分配并且不会显着改变运行时间，但对使用的总内存和最大驻留有很大影响：

 934,566,488 bytes allocated in the heap
   1,441,744 bytes copied during GC
      68,112 bytes maximum residency (1 sample(s))
      23,272 bytes maximum slop
           1 MB total memory in use (0 MB lost due to fragmentation)

运行时间的最大改进是避免randomIO使用StdGen非常慢。

我很惊讶这些功能需要多少时间bn*，但没有看到任何明显的低效率。

score 7 · Accepted Answer

我很难消化这些配置文件，但我之前被踢过屁股，因为MonadRandomon Hackage 是严格的。创建一个懒惰的版本MonadRandom让我的记忆问题消失了。

我的同事还没有获得发布代码的许可，但我已经Control.Monad.LazyRandom在pastebin上发布了。或者，如果您想查看一些解释完全惰性随机搜索的摘录，包括无限的随机计算列表，请查看体验报告：计算生物学中的 Haskell。

score 4 · Accepted Answer

我整理了一个非常基本的示例，发布在此处：http ://hpaste.org/71919 。我不确定它是否像您的示例一样......只是一个似乎有效的非常小的东西。

编译-prof和-fprof-auto运行 100000 次迭代会产生以下分析输出的头部（请原谅我的行号）：

  8 COST CENTRE                   MODULE               %time %alloc
  9 
 10 sample                        AI.Util.ProbDist      31.5   36.6
 11 bnParents                     AI.Probability.Bayes  23.2    0.0
 12 bnRank                        AI.Probability.Bayes  10.7   23.7
 13 weightedSample.go             AI.Probability.Bayes   9.6   13.4
 14 bnVars                        AI.Probability.Bayes   8.6   16.2
 15 likelihoodWeighting           AI.Probability.Bayes   3.8    4.2
 16 likelihoodWeighting.getSample AI.Probability.Bayes   2.1    0.7
 17 sample.cumulative             AI.Util.ProbDist       1.7    2.1
 18 bnCond                        AI.Probability.Bayes   1.6    0.0
 19 bnRank.ps                     AI.Probability.Bayes   1.1    0.0

以下是汇总统计数据：

1,433,944,752 bytes allocated in the heap
 1,016,435,800 bytes copied during GC
   176,719,648 bytes maximum residency (11 sample(s))
     1,900,232 bytes maximum slop
           400 MB total memory in use (0 MB lost due to fragmentation)

INIT    time    0.00s  (  0.00s elapsed)
MUT     time    1.40s  (  1.41s elapsed)
GC      time    1.08s  (  1.24s elapsed)
Total   time    2.47s  (  2.65s elapsed)

%GC     time      43.6%  (46.8% elapsed)

Alloc rate    1,026,674,336 bytes per MUT second

Productivity  56.4% of total user, 52.6% of total elapsed

请注意，分析器将手指指向sample。我通过使用强制输入return该函数$!，然后是一些汇总统计信息：

1,776,908,816 bytes allocated in the heap
  165,232,656 bytes copied during GC
   34,963,136 bytes maximum residency (7 sample(s))
      483,192 bytes maximum slop
           68 MB total memory in use (0 MB lost due to fragmentation)

INIT    time    0.00s  (  0.00s elapsed)
MUT     time    2.42s  (  2.44s elapsed)
GC      time    0.21s  (  0.23s elapsed)
Total   time    2.63s  (  2.68s elapsed)

%GC     time       7.9%  (8.8% elapsed)

Alloc rate    733,248,745 bytes per MUT second

Productivity  92.1% of total user, 90.4% of total elapsed

在 GC 方面效率更高，但当时并没有太大变化。您也许可以继续以这种配置文件/调整方式进行迭代，以针对您的瓶颈并获得更好的性能。

score 4 · Accepted Answer

我认为您的初步诊断是正确的，而且我从未见过在记忆效应开始后有用的分析报告。

问题是您要遍历列表两次，一次sequence又一次 for sum。在 Haskell 中，大型列表的多个列表遍历对性能非常非常不利。解决方案通常是使用某种类型的折叠，例如foldM. 你的sampleMean函数可以写成

{-# LANGUAGE BangPatterns #-}

sampleMean2 :: MonadRandom m => Int -> m Float -> m Float
sampleMean2 n dist = foldM (\(!a) mb -> liftM (+a) mb) 0 $ replicate n dist

例如，只遍历列表一次。

你也可以做同样的事情likelihoodWeighting。为了防止重击，确保折叠函数中的累加器具有适当的严格性很重要。

performance - 分析 Haskell 程序

4 回答 4

Related

Reference