haskell - Repa 中的转置和累积和性能不佳

Question

我在 Haskell 库 Repa 中开发了一个如下定义的累积求和函数。但是，当将此功能与转置操作结合使用时，我遇到了一个问题。以下所有 3 项操作都需要不到一秒钟的时间：

cumsum $ cumsum $ cumsum x
transpose $ transpose $ transpose x
transpose $ cumsum x

但是，如果我写：

cumsum $ transpose x

性能严重下降。虽然在 1920x1080 的图像上，单独的每个操作都需要不到一秒的时间，但现在组合起来需要 30 多秒......

关于可能导致这种情况的任何想法？我的直觉告诉我这与延迟数组有关，不是在正确的时间强制等等……但我还没有足够的经验来追踪这个问题。

{-# LANGUAGE TypeOperators, FlexibleContexts, TypeFamilies #-}

import Data.Array.Repa as Repa

{-# INLINE indexSlice #-}
indexSlice :: (Shape sh, Elt a) => Int -> Array (sh :. Int) a -> (sh :. Int) -> a
indexSlice from arr (z :. ix) = arr `unsafeIndex` (z :. (ix + from))

{-# INLINE sliceRange #-}
sliceRange :: (Slice sh, Shape sh, Elt a) => Int -> Int -> Array (sh :. Int) a -> Array (sh :. Int) a
sliceRange from to arr = fromFunction (z :. (to - from + 1)) $ indexSlice from arr
    where (z :. _) = extent arr

{-# INLINE cumsum' #-}
cumsum' :: (Slice (SliceShape sh), Slice sh, Shape (FullShape sh), Shape (SliceShape sh), Elt a, Num a) =>
     Array (FullShape sh :. Int) a -> t -> (sh :. Int) -> a
cumsum' arr f (sh :. outer) = Repa.sumAll $ sliceRange 0 outer $ Repa.slice arr (sh :. All)

{-# INLINE cumsum #-}
cumsum :: (FullShape sh ~ sh, Slice sh, Slice (SliceShape sh), Shape sh, Shape (SliceShape sh), Elt a, Num a) =>
    Array (sh :. Int) a -> Array (sh :. Int) a
cumsum arr = Repa.force $ unsafeTraverse arr id $ cumsum' arr

score 25 · Accepted Answer

从库实现者的角度来看，调试的方法是为可疑操作创建一个包装器，然后查看核心代码以查看融合是否有效。

-- Main.hs ---------------------------------------------------
import Solver
import Data.Array.Repa.IO.BMP

main 
 = do   Right img       <- readImageFromBMP "whatever.bmp"
        print $ cumsumBMP img

-- Solver.hs --------------------------------------------------
{-# LANGUAGE TypeOperators, FlexibleContexts, TypeFamilies #-}
module Solver (cumsumBMP) where
import Data.Array.Repa  as Repa
import Data.Word

{- all your defs -}

{-# NOINLINE cumsumBMP #-}
cumsumBMP :: Array DIM3 Word8 -> Array DIM3 Word8
cumsumBMP img = cumsum $ transpose img

我已经将“求解器”代码放在一个单独的模块中，所以我们只需要浏览我们关心的定义的核心代码。

编译如下：

touch Solver.hs ; ghc -O2 --make Main.hs \
 -ddump-simpl -dsuppress-module-prefixes -dsuppress-coercions  > dump

转到关键字的定义cumsumBMP并搜索letrec。搜索letrec是查找内部循环的快速方法。

不远处，我看到了这个：（稍微重新格式化）

case gen_a1tr
of _ {
  GenManifest vec_a1tv ->
    case sh2_a1tc  `cast` ... of _ { :. sh3_a1iu  sh4_a1iv ->
    case ix'_a1t9  `cast` ... of _ { :. sh1'_a1iz sh2'_a1iA ->
    case sh3_a1iu  `cast` ... of _ { :. sh5_X1n0  sh6_X1n2 ->
    case sh1'_a1iz `cast` ... of _ { :. sh1'1_X1n9 sh2'1_X1nb ->
    case sh5_X1n0             of _ { :. sh7_X1n8   sh8_X1na ->
    ...
    case sh2'1_X1nb           of _ { I# y3_X1nO ->
    case sh4_a1iv             of _ { I# y4_X1nP ->
    case sh2'_a1iA            of _ { I# y5_X1nX ->
    ...
    let { x3_a1x6 :: Int# [LclId]
      x3_a1x6 =
        +#
          (*#
             (+#
                (*#
                   y1_a1iM
                   y2_X1nG)
                y3_X1nO)
             y4_X1nP)
          y5_X1nX } in
    case >=#
           x3_a1x6
           0
    of ...

灾难！绑定显然做了一些有用的x3_a1x6工作（乘法、加法等），但它包含在一系列拆箱操作中，这些操作也为每次循环迭代执行。更糟糕的是，它在每次迭代时都会对数组的长度和宽度（形状）进行拆箱，并且这些信息将始终相同。GHC 确实应该将这些 case 表达式从循环中浮出，但现在还没有。这是GHC trac 上的问题 #4081 的一个实例，希望很快会得到修复。

解决方法是应用于deepSeqArray传入数组。这就要求它在顶层（循环外）的价值，这让 GHC 知道可以将案例匹配进一步向上移动。对于像这样的函数cumsumBMP，我们还期望传入的数组已经是清单，因此我们可以为此添加显式大小写匹配：

{-# NOINLINE cumsumBMP #-}
cumsumBMP :: Array DIM3 Word8 -> Array DIM3 Word8
cumsumBMP img@(Array _ [Region RangeAll (GenManifest _)])
  = img `deepSeqArray` cumsum $ transpose img

再次编译，内部循环现在看起来好多了：

letrec {
$s$wfoldlM'_loop_s2mW [...]
  :: Int# -> Word# -> Word# [...]
$s$wfoldlM'_loop_s2mW =
  \ (sc_s2mA :: Int#) (sc1_s2mB :: Word#) ->
    case <=# sc_s2mA a_s2ji of _ {
      False -> sc1_s2mB;
      True ->
        $s$wfoldlM'_loop_s2mW
          (+# sc_s2mA 1)
          (narrow8Word#
             (plusWord#
                sc1_s2mB
                (indexWord8Array#
                   rb3_a2gZ
                   (+#
                      rb1_a2gX
                      (+#
                         (*#
                            (+#
                               (*#
                                  wild19_X1zO
                                  ipv1_X1m5)
                               sc_s2mA)
                            ipv2_X1m0)
                         wild20_X1Ct)))))
    }; } in

这是一个紧密的尾递归循环，只使用原始操作。如果您使用进行编译-fllvm -optlo-O3，那么没有理由运行速度不如等效的 C 程序。

运行时有一点小问题：

desire:tmp benl$ ./Main 
Main: Solver.hs:(50,1)-(51,45): Non-exhaustive patterns in function cumsumBMP

这只是提醒我们，我们需要在调用之前强制数组cumsumBMP。

-- Main.hs ---------------------------------------------------
...
import Data.Array.Repa as Repa
main 
 = do   Right img       <- readImageFromBMP "whatever.bmp"
        print $ cumsumBMP $ Repa.force img

总之：

您需要在deepSeqArray顶级函数中添加一些和模式匹配的 goop，以解决 GHC 中当前的不合理问题。cumsumBMP上述函数的最终版本证明了这一点。如果您希望 GHC HQ 尽快解决此问题，请将您自己作为 cc 添加到GHC trac 上的 Issue #4081。修复此问题后，Repa 程序会更漂亮。
您不需要将 goop 添加到每个函数中。在这个例子中，我不需要接触indexSlice和朋友。一般规则是将 goop 添加到使用force,fold或的函数中sumAll。这些函数实例化了对数组数据进行操作的实际循环，也就是说，它们将延迟数组转换为清单值。
一段 Repa 代码的性能取决于它与实际代码的使用环境。如果您传递您的顶级函数延迟数组，那么它们将运行非常缓慢。The Repa Tutorial对此有更多的讨论。
使用 repa-io 库读取的 BMP 文件不是预先强制的，因此您需要在使用前强制它们。这可能是错误的默认设置，所以我将在下一个版本中更改它。

haskell - Repa 中的转置和累积和性能不佳

1 回答 1

Related

Reference