4

我在 Haskell 和 Scala 中有一段非常简单的代码。此代码旨在以非常紧密的循环运行,因此性能很重要。问题是 Haskell 比 Scala 慢大约 10 倍。这是 Haskell 代码。

{-# LANGUAGE BangPatterns #-}
import qualified Data.Vector.Unboxed as VU

newtype AffineTransform = AffineTransform {get :: (VU.Vector Double)} deriving (Show)

{-# INLINE runAffineTransform #-}
runAffineTransform :: AffineTransform -> (Double, Double) -> (Double, Double)
runAffineTransform affTr (!x, !y) = (get affTr `VU.unsafeIndex` 0 * x + get affTr `VU.unsafeIndex` 1 * y + get affTr `VU.unsafeIndex` 2, 
                                      get affTr `VU.unsafeIndex` 3 * x + get affTr `VU.unsafeIndex` 4 * y + get affTr `VU.unsafeIndex` 5)

testAffineTransformSpeed :: AffineTransform -> Int -> (Double, Double)
testAffineTransformSpeed affTr count = go count (0.5, 0.5)
  where go :: Int -> (Double, Double) -> (Double, Double)
        go 0 res = res
        go !n !res = go (n-1) (runAffineTransform affTr res)

还可以做些什么来改进此代码?

4

2 回答 2

9

我定义了以下严格/未装箱对类型:

import System.Random.MWC -- for later
import Control.DeepSeq

data SP = SP {
    one :: {-# UNPACK #-} !Double
  , two :: {-# UNPACK #-} !Double
  } deriving Show

instance NFData SP where
  rnf p = rnf (one p) `seq` rnf (two p) `seq` ()

并在runAffineTransform函数中替换它:

runAffineTransform2 :: AffineTransform -> SP -> SP
runAffineTransform2 affTr !(SP x y) =
  SP (  get affTr `U.unsafeIndex` 0 * x
      + get affTr `U.unsafeIndex` 1 * y
      + get affTr `U.unsafeIndex` 2     )

     (  get affTr `U.unsafeIndex` 3 * x
      + get affTr `U.unsafeIndex` 4 * y
      + get affTr `U.unsafeIndex` 5     )
{-# INLINE runAffineTransform2 #-}

然后运行这个基准套件:

main :: IO ()
main = do
  g  <- create
  zs <- fmap (AffineTransform . U.fromList)
             (replicateM 100000 (uniformR (0 :: Double, 1) g))

  let myConfig = defaultConfig { cfgPerformGC = ljust True }

  defaultMainWith myConfig (return ()) [
      bench "yours" $ nf (testAffineTransformSpeed  zs) 10
    , bench "mine"  $ nf (testAffineTransformSpeed2 zs) 10
    ]

编译-O2并运行,并观察到一些(~4x)加速:

benchmarking yours
mean: 257.4559 ns, lb 256.2492 ns, ub 258.9761 ns, ci 0.950
std dev: 6.889905 ns, lb 5.688330 ns, ub 8.839753 ns, ci 0.950
found 5 outliers among 100 samples (5.0%)
  3 (3.0%) high mild
  2 (2.0%) high severe
variance introduced by outliers: 20.944%
variance is moderately inflated by outliers

benchmarking mine
mean: 69.56408 ns, lb 69.29910 ns, ub 69.86838 ns, ci 0.950
std dev: 1.448874 ns, lb 1.261444 ns, ub 1.718074 ns, ci 0.950
found 4 outliers among 100 samples (4.0%)
  4 (4.0%) high mild
variance introduced by outliers: 14.190%
variance is moderately inflated by outliers

完整的代码在这里

编辑

我还在这里发布了标准的输出报告。

于 2013-08-22T04:36:08.467 回答
8

主要问题是

runAffineTransform affTr (!x, !y) = (get affTr `VU.unsafeIndex` 0 * x
                                     + get affTr `VU.unsafeIndex` 1 * y
                                     + get affTr `VU.unsafeIndex` 2, 
                                       get affTr `VU.unsafeIndex` 3 * x
                                     + get affTr `VU.unsafeIndex` 4 * y
                                     + get affTr `VU.unsafeIndex` 5)

产生一对thunk。组件在runAffineTransform被调用时不会被评估,它们仍然是 thunk,直到某些消费者要求对它们进行评估。

testAffineTransformSpeed affTr count = go count (0.5, 0.5)
  where go :: Int -> (Double, Double) -> (Double, Double)
        go 0 res = res
        go !n !res = go (n-1) (runAffineTransform affTr res)

不是那个消费者,爆炸res只会将它评估为最外面的构造函数,(,)你会得到一个结果

runAffineTransform affTr (runAffineTrasform affTr (runAffineTransform affTr (...)))

仅在最后需要标准形式时才对其进行评估。

如果您强制立即评估结果的组成部分,

runAffineTransform affTr (!x, !y) = case
  (  get affTr `U.unsafeIndex` 0 * x
   + get affTr `U.unsafeIndex` 1 * y
   + get affTr `U.unsafeIndex` 2
  ,  get affTr `U.unsafeIndex` 3 * x
   + get affTr `U.unsafeIndex` 4 * y
   + get affTr `U.unsafeIndex` 5
  ) of (!a,!b) -> (a,b)

并让它内联,与jtobin使用自定义严格一对unboxedDouble# s 的版本的主要区别在于,对于循环,testAffineTransformSpeed您使用 boxed s 作为参数获得一个初始迭代Double,最后,结果的组件被装箱,这增加了一些恒定的开销(我的盒子上每个循环大约 5 纳秒)。循环的主要部分在这两种情况下都接受一个Int#和两个Double#参数,并且循环体是相同的,除了n = 0到达时的装箱。

当然,通过使用未装箱的严格对类型强制立即评估组件会更好。

于 2013-08-22T12:14:57.723 回答