performance - 调试最长公共子序列算法的性能瓶颈

Question

我正在使用向量库和状态单子在 Haskell 中编写最长的公共子序列算法（以封装Miller O(NP) 算法的非常命令性和可变性）。我已经为某个我需要它的项目用 C 语言编写了它，现在我正在用 Haskell 编写它，以探索如何编写那些与 C 相匹配的具有良好性能的命令式网格行走算法。我用未装箱向量编写的版本对于相同的输入，大约比 C 版本慢 4 倍（并使用正确的优化标志编译 - 我使用了系统时钟时间和Criterion验证 Haskell 和 C 版本之间的相对时间测量的方法，以及相同的数据类型，包括大输入和小输入）。我一直在试图找出性能问题可能出在哪里，并将感谢反馈 - 我可能在这里遇到了一些众所周知的性能问题，尤其是在我在这里大量使用的向量库中。

在我的代码中，我有一个最常调用的名为 gridWalk 的函数，它也完成了大部分工作。性能下降可能存在，但我无法弄清楚它可能是什么。完整的 Haskell 代码在这里。以下代码片段：

import Data.Vector.Unboxed.Mutable as MU
import Data.Vector.Unboxed as U hiding (mapM_)
import Control.Monad.ST as ST
import Control.Monad.Primitive (PrimState)
import Control.Monad (when) 
import Data.STRef (newSTRef, modifySTRef, readSTRef)
import Data.Int


type MVI1 s  = MVector (PrimState (ST s)) Int

cmp :: U.Vector Int32 -> U.Vector Int32 -> Int -> Int -> Int
cmp a b i j = go 0 i j
               where
                 n = U.length a
                 m = U.length b
                 go !len !i !j| (i<n) && (j<m) && ((unsafeIndex a i) == (unsafeIndex b j)) = go (len+1) (i+1) (j+1)
                                    | otherwise = len

-- function to find previous y on diagonal k for furthest point 
findYP :: MVI1 s -> Int -> Int -> ST s (Int,Int)
findYP fp k offset = do
              let k0 = k+offset-1
                  k1 = k+offset+1
              y0 <- MU.unsafeRead fp k0 >>= \x -> return $ 1+x
              y1 <- MU.unsafeRead fp k1
              if y0 > y1 then return (k0,y0)
              else return (k1,y1)
{-#INLINE findYP #-}

gridWalk :: Vector Int32 -> Vector Int32 -> MVI1 s -> Int -> (Vector Int32 -> Vector Int32 -> Int -> Int -> Int) -> ST s ()
gridWalk a b fp !k cmp = {-#SCC gridWalk #-} do
   let !offset = 1+U.length a
   (!kp,!yp) <- {-#SCC findYP #-} findYP fp k offset                          
   let xp = yp-k
       len = {-#SCC cmp #-} cmp a b xp yp
       x = xp+len
       y = yp+len

   {-#SCC "updateFP" #-} MU.unsafeWrite fp (k+offset) y  
   return ()
{-#INLINE gridWalk #-}

-- The function below executes ct times, and updates furthest point as they are found during furthest point search
findSnakes :: Vector Int32 -> Vector Int32 -> MVI1 s ->  Int -> Int -> (Vector Int32 -> Vector Int32 -> Int -> Int -> Int) -> (Int -> Int -> Int) -> ST s ()
findSnakes a b fp !k !ct cmp op = {-#SCC findSnakes #-} U.forM_ (U.fromList [0..ct-1]) (\x -> gridWalk a b fp (op k x) cmp)
{-#INLINE findSnakes #-}

我添加了一些成本中心注释，并使用特定 LCS 输入运行分析以进行测试。这是我得到的：

  total time  =        2.39 secs   (2394 ticks @ 1000 us, 1 processor)
  total alloc = 4,612,756,880 bytes  (excludes profiling overheads)

COST CENTRE MODULE    %time %alloc

gridWalk    Main       67.5   52.7
findSnakes  Main       23.2   27.8
cmp         Main        4.2    0.0
findYP      Main        3.5   19.4
updateFP    Main        1.6    0.0


                                                         individual     inherited
COST CENTRE    MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN           MAIN                     64           0    0.0    0.0   100.0  100.0
 main          Main                    129           0    0.0    0.0     0.0    0.0
 CAF           Main                    127           0    0.0    0.0   100.0  100.0
  findSnakes   Main                    141           0    0.0    0.0     0.0    0.0
  main         Main                    128           1    0.0    0.0   100.0  100.0
   findSnakes  Main                    138           0    0.0    0.0     0.0    0.0
    gridWalk   Main                    139           0    0.0    0.0     0.0    0.0
     cmp       Main                    140           0    0.0    0.0     0.0    0.0
   while       Main                    132        4001    0.1    0.0   100.0  100.0
    findSnakes Main                    133       12000   23.2   27.8    99.9   99.9
     gridWalk  Main                    134    16004000   67.5   52.7    76.7   72.2
      cmp      Main                    137    16004000    4.2    0.0     4.2    0.0
      updateFP Main                    136    16004000    1.6    0.0     1.6    0.0
      findYP   Main                    135    16004000    3.5   19.4     3.5   19.4
   newVI1      Main                    130           1    0.0    0.0     0.0    0.0
   newVI1.\   Main                    131        8004    0.0    0.0     0.0    0.0
 CAF           GHC.Conc.Signal         112           0    0.0    0.0     0.0    0.0
 CAF           GHC.IO.Encoding         104           0    0.0    0.0     0.0    0.0
 CAF           GHC.IO.Encoding.Iconv   102           0    0.0    0.0     0.0    0.0
 CAF           GHC.IO.Handle.FD         95           0    0.0    0.0     0.0    0.0

如果我正确解释分析输出（并假设分析没有太多失真），gridWalk则需要大部分时间，但主要功能cmp和findYP执行繁重工作的主要功能gridWalk似乎在分析报告中花费的时间很少。那么，也许瓶颈在于函数用来调用的forM_包装器？堆配置文件看起来也很干净：findSnakesgridWalk

阅读核心，什么都没有真正跳出来。我认为内部循环中的某些值可能会被装箱，但我没有在核心中发现它们。我希望性能问题是由于我错过了一些简单的事情。

更新

根据@DanielFischer 的建议，我将forM_of替换Data.Vector.Unboxed为Control.MonadinfindSnakes函数，将 C 版本的性能从 4 倍提高到 2.5 倍。如果您想尝试一下，Haskell 和 C 版本现在发布在这里。

我仍在挖掘核心，看看瓶颈在哪里。gridWalk是最常被调用的函数，为了使其表现良好，lcsh应该将whileM_循环减少为条件检查和内联findSnakes代码的良好迭代内部循环。我怀疑在汇编中，whileM_循环不是这种情况，但是由于我对翻译核心以及在汇编中定位名称损坏的 GHC 函数不是很了解，我想这只是耐心地解决问题直到我想通了。同时，如果有任何关于性能修复的指示，我们将不胜感激。

我能想到的另一种可能性是函数调用期间堆检查的开销。正如分析报告中所见，gridWalk被调用了 16004000 次。假设堆检查需要 6 个周期（我猜它会更少，但我们仍然假设），在 3.33GHz 的机器上进行 96024000 个周期大约是 0.02 秒。

此外，一些性能数据：

Haskell code (GHC 7.6.1 x86_64)forM_：修复前大约是 0.25 秒。

 time ./T
1

real    0m0.150s
user    0m0.145s
sys     0m0.003s

C code (gcc 4.7.2 x86_64)：

time ./test
1

real    0m0.065s
user    0m0.063s
sys     0m0.000s

更新 2：

更新的代码在这里。使用STUArray也不会改变数字。性能大约是 1.5 倍Mac OS X (x86_64,ghc7.6.1)，与@DanielFischer 在 Linux 上的报告非常相似。

哈斯克尔代码：

$ time ./Diff
1

real    0m0.087s
user    0m0.084s
sys 0m0.003s

C代码：

$ time ./test
1

real    0m0.056s
user    0m0.053s
sys 0m0.002s

一看cmm，调用是尾递归的，被变成循环llvm。但是每次新的迭代似乎也分配了调用堆检查的新值，因此，可能解释了性能上的差异。我必须考虑如何以这样的方式编写尾递归，以便在迭代中不分配任何值，从而避免堆检查和分配开销。

score 10 · Accepted Answer

你受到了巨大的打击

U.forM_ (U.fromList [0..ct-1])

在findSnakes. 我确信这不应该发生（票？），但是Vector每次findSnakes调用时都会分配一个新的遍历。如果你使用

Control.Monad.forM_ [0 .. ct-1]

相反，运行时间大约减半，并且这里的分配减少了大约 500 倍。（GHC 优化C.M.forM_ [0 :: Int .. limit]得很好，列表被淘汰了，剩下的基本上就是一个循环了。）自己写循环可以做得稍微好一点。

一些导致无偿分配/代码大小膨胀而又不损害性能的事情是

的未使用Bool论点lcsh
和的cmp论点；如果从不使用与顶层不同的比较来调用它们，则该参数会导致不必要的代码重复。findSnakesgridWalkcmp
的一般类型while；将其专门用于使用的类型ST s Bool -> ST s () -> ST s ()减少了分配（很多），也减少了运行时间（稍微，但很明显，在这里）。

关于性能分析的一个通用词：为性能分析编译程序会抑制许多优化。特别是对于像 fusion 这样的库vector，bytestring或者text大量使用 fusion 的库，分析通常会产生误导性的结果。

例如，您的原始代码在此处生成

    total time  =        3.42 secs   (3415 ticks @ 1000 us, 1 processor)
    total alloc = 4,612,756,880 bytes  (excludes profiling overheads)

COST CENTRE MODULE    %time %alloc  ticks     bytes

gridWalk    Main       63.7   52.7   2176 2432608000
findSnakes  Main       20.0   27.8    682 1281440080
cmp         Main        9.2    0.0    313        16
findYP      Main        4.2   19.4    144 896224000
updateFP    Main        2.7    0.0     91         0

len只是在 in的绑定上添加一个 bang 在gridWalk非分析版本中根本没有改变，但对于分析版本

    total time  =        2.98 secs   (2985 ticks @ 1000 us, 1 processor)
    total alloc = 3,204,404,880 bytes  (excludes profiling overheads)

COST CENTRE MODULE    %time %alloc  ticks     bytes

gridWalk    Main       63.0   32.0   1881 1024256000
findSnakes  Main       22.2   40.0    663 1281440080
cmp         Main        7.2    0.0    214        16
findYP      Main        4.7   28.0    140 896224000
updateFP    Main        2.7    0.0     82         0

它有很大的不同。对于包含上述更改的版本（以及 bang on len）gridWalk，分析版本说

total alloc = 1,923,412,776 bytes  (excludes profiling overheads)

但非分析版本

     1,814,424 bytes allocated in the heap
        10,808 bytes copied during GC
        49,064 bytes maximum residency (2 sample(s))
        25,912 bytes maximum slop
             1 MB total memory in use (0 MB lost due to fragmentation)

                                  Tot time (elapsed)  Avg pause  Max pause
Gen  0         2 colls,     0 par    0.00s    0.00s     0.0000s    0.0000s
Gen  1         2 colls,     0 par    0.00s    0.00s     0.0001s    0.0001s

INIT    time    0.00s  (  0.00s elapsed)
MUT     time    0.12s  (  0.12s elapsed)
GC      time    0.00s  (  0.00s elapsed)
EXIT    time    0.00s  (  0.00s elapsed)
Total   time    0.12s  (  0.12s elapsed)

说它分配的资源比分析版本少 1000 倍。

对于vector和朋友的代码，识别瓶颈比分析更可靠（不幸的是，它也更耗时和困难）正在研究生成的核心（或程序集，如果你精通阅读的话）。

关于更新，C 在我的盒子上运行得有点慢（gcc-4.7.2，-O3）

$ time ./miltest1

real    0m0.074s
user    0m0.073s
sys     0m0.001s

但是 Haskell 差不多

$ time ./hsmiller
1

real    0m0.151s
user    0m0.149s
sys     0m0.001s

通过 LLVM 后端编译时，速度会快一些：

$ time ./hsmiller1

real    0m0.131s
user    0m0.129s
sys     0m0.001s

当我们用forM_手动循环替换

findSnakes a b fp !k !ct op = go 0
  where
    go x
        | x < ct    = gridWalk a b fp (op k x) >> go (x+1)
        | otherwise = return ()

它变得有点快，

$ time ./hsmiller
1

real    0m0.124s
user    0m0.121s
sys     0m0.002s

分别通过 LLVM：

$ time ./hsmiller
1

real    0m0.108s
user    0m0.107s
sys     0m0.000s

总的来说，生成的核心看起来不错，一个小烦恼是

Main.$wa
  :: forall s.
     GHC.Prim.Int#
     -> GHC.Types.Int
     -> GHC.Prim.State# s
     -> (# GHC.Prim.State# s, Main.MVI1 s #)

和一个稍微迂回的实现。这是通过newVI1在第二个参数中设置严格来解决的，

newVI1 n !x = do

由于不经常调用它，因此对性能的影响当然可以忽略不计。

肉是核心lcsh，看起来还不错。唯一被装箱的东西是Int从 / 写入到 s的 s STRef，这是不可避免的。不太令人愉快的是，核心包含大量重复代码，但根据我的经验，这很少是真正的性能问题，而且并非所有重复代码都能在代码生成中幸存下来。

为了让它表现良好，lcsh应该将whileM_循环减少为条件检查和内联findSnakes代码的一个很好的迭代内部循环。

当您向中添加INLINEpragma时，您会得到一个内部循环whileM_，但该循环并不好，而且在这种情况下，它比whileM_外联慢得多（我不确定这是否仅仅是由于代码大小，但它可能是）。

performance - 调试最长公共子序列算法的性能瓶颈

1 回答 1

Related

Reference