performance - 如何优化可以完全严格的循环

Question

我正在尝试为Project Euler Problem #145编写一个蛮力解决方案，但我无法让我的解决方案在不到 1 分 30 秒内运行。

（我知道有各种捷径，甚至是纸笔解决方案；出于这个问题的目的，我不考虑这些）。

在我迄今为止提出的最佳版本中，分析表明大部分时间都花在foldDigits. 这个函数根本不需要偷懒，在我看来应该优化为一个简单的循环。如您所见，我试图使程序的各个部分变得严格。

所以我的问题是：在不改变整体算法的情况下，有没有办法把这个程序的执行时间降低到亚分钟标记？

（或者如果没有，有没有办法看到代码foldDigits尽可能优化？）

-- ghc -O3 -threaded Euler-145.hs && Euler-145.exe +RTS -N4

{-# LANGUAGE BangPatterns #-}

import Control.Parallel.Strategies

foldDigits :: (a -> Int -> a) -> a -> Int -> a
foldDigits f !acc !n
    | n < 10    = i
    | otherwise = foldDigits f i d
  where (d, m) = n `quotRem` 10
        !i     = f acc m

reverseNumber :: Int -> Int
reverseNumber !n
    = foldDigits accumulate 0 n
  where accumulate !v !d = v * 10 + d

allDigitsOdd :: Int -> Bool
allDigitsOdd n
    = foldDigits andOdd True n
  where andOdd !a d = a && isOdd d
        isOdd !x    = x `rem` 2 /= 0

isReversible :: Int -> Bool
isReversible n
    = notDivisibleByTen n && allDigitsOdd (n + rn)
  where rn                   = reverseNumber n
        notDivisibleByTen !x = x `rem` 10 /= 0

countRange acc start end
    | start > end = acc
    | otherwise   = countRange (acc + v) (start + 1) end
  where v = if isReversible start then 1 else 0

main
    = print $ sum $ parMap rseq cr ranges
  where max       = 1000000000
        qmax      = max `div` 4
        ranges    = [(1, qmax), (qmax, qmax * 2), (qmax * 2, qmax * 3), (qmax * 3, max)]
        cr (s, e) = countRange 0 s e

score 8 · Accepted Answer

就目前而言，ghc-7.6.1 为foldDigits(with -O2) 生成的核心是

Rec {
$wfoldDigits_r2cK
  :: forall a_aha.
     (a_aha -> GHC.Types.Int -> a_aha)
     -> a_aha -> GHC.Prim.Int# -> a_aha
[GblId, Arity=3, Caf=NoCafRefs, Str=DmdType C(C(S))SL]
$wfoldDigits_r2cK =
  \ (@ a_aha)
    (w_s284 :: a_aha -> GHC.Types.Int -> a_aha)
    (w1_s285 :: a_aha)
    (ww_s288 :: GHC.Prim.Int#) ->
    case w1_s285 of acc_Xhi { __DEFAULT ->
    let {
      ds_sNo [Dmd=Just D(D(T)S)] :: (GHC.Types.Int, GHC.Types.Int)
      [LclId, Str=DmdType]
      ds_sNo =
        case GHC.Prim.quotRemInt# ww_s288 10
        of _ { (# ipv_aJA, ipv1_aJB #) ->
        (GHC.Types.I# ipv_aJA, GHC.Types.I# ipv1_aJB)
        } } in
    case w_s284 acc_Xhi (case ds_sNo of _ { (d_arS, m_Xsi) -> m_Xsi })
    of i_ahg { __DEFAULT ->
    case GHC.Prim.<# ww_s288 10 of _ {
      GHC.Types.False ->
        case ds_sNo of _ { (d_Xsi, m_Xs5) ->
        case d_Xsi of _ { GHC.Types.I# ww1_X28L ->
        $wfoldDigits_r2cK @ a_aha w_s284 i_ahg ww1_X28L
        }
        };
      GHC.Types.True -> i_ahg
    }
    }
    }
end Rec }

如您所见，它重新装箱了quotRem调用的结果。问题是f这里没有可用的属性，并且作为递归函数，foldDigits不能内联。

使用手动工人包装器转换使函数参数静态，

foldDigits :: (a -> Int -> a) -> a -> Int -> a
foldDigits f = go
  where
    go !acc 0 = acc
    go acc n = case n `quotRem` 10 of
                 (q,r) -> go (f acc r) q

foldDigits变得可内联，并且您会获得专门的版本供您对未装箱数据进行操作，但没有顶级foldDigits，例如

Rec {
$wgo_r2di :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=2, Caf=NoCafRefs, Str=DmdType LL]
$wgo_r2di =
  \ (ww_s28F :: GHC.Prim.Int#) (ww1_s28J :: GHC.Prim.Int#) ->
    case ww1_s28J of ds_XJh {
      __DEFAULT ->
        case GHC.Prim.quotRemInt# ds_XJh 10
        of _ { (# ipv_aJK, ipv1_aJL #) ->
        $wgo_r2di (GHC.Prim.+# (GHC.Prim.*# ww_s28F 10) ipv1_aJL) ipv_aJK
        };
      0 -> ww_s28F
    }
end Rec }

并且对计算时间的影响是有形的，对于原始的，我得到了

$ ./eul145 +RTS -s -N2
608720
1,814,289,579,592 bytes allocated in the heap
     196,407,088 bytes copied during GC
          47,184 bytes maximum residency (2 sample(s))
          30,640 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     1827331 colls, 1827331 par   23.77s   11.86s     0.0000s    0.0041s
  Gen  1         2 colls,     1 par    0.00s    0.00s     0.0001s    0.0001s

  Parallel GC work balance: 54.94% (serial 0%, perfect 100%)

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)

  SPARKS: 4 (3 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time  620.52s  (313.51s elapsed)
  GC      time   23.77s  ( 11.86s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time  644.29s  (325.37s elapsed)

  Alloc rate    2,923,834,808 bytes per MUT second

（-N2因为我的 i5 只有两个物理内核，所以我使用了）。

$ ./eul145 +RTS -s -N2
608720
  16,000,063,624 bytes allocated in the heap
         403,384 bytes copied during GC
          47,184 bytes maximum residency (2 sample(s))
          30,640 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     15852 colls, 15852 par    0.34s    0.17s     0.0000s    0.0037s
  Gen  1         2 colls,     1 par    0.00s    0.00s     0.0001s    0.0001s

  Parallel GC work balance: 43.86% (serial 0%, perfect 100%)

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)

  SPARKS: 4 (3 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time  314.85s  (160.08s elapsed)
  GC      time    0.34s  (  0.17s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time  315.20s  (160.25s elapsed)

  Alloc rate    50,817,657 bytes per MUT second

  Productivity  99.9% of total user, 196.5% of total elapsed

与修改。运行时间大致减半，分配减少了 100 倍。

performance - 如何优化可以完全严格的循环

1 回答 1

Related

Reference