4

我使用 Parallel Haskell 编写了以下程序来找到 10 亿的除数。

import Control.Parallel

parfindDivisors :: Integer->[Integer]
parfindDivisors n = f1 `par` (f2 `par` (f1 ++ f2))
              where f1=filter g [1..(quot n 4)]
                    f2=filter g [(quot n 4)+1..(quot n 2)]
                    g z = n `rem` z == 0

main = print (parfindDivisors 1000000000)

我已经编译了程序ghc -rtsopts -threaded findDivisors.hs并运行它: findDivisors.exe +RTS -s -N2 -RTS

与以下简单版本相比,我发现了 50% 的加速:

findDivisors :: Integer->[Integer]
findDivisors n = filter g [1..(quot n 2)] 
      where  g z = n `rem` z == 0

我的处理器是 Intel 的双核 2 duo。我想知道上面的代码是否有任何改进。因为在程序打印的统计数据中说: Parallel GC work balance: 1.01 (16940708 / 16772868, ideal 2) 以及SPARKS: 2 (1 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled) 这些是什么转换溢出GC'd失败以及如何帮助改善时间。

4

3 回答 3

2

IMO,Par单子有助于推理并行性。它比处理parand高一点pseq

parfindDivisors这是使用Parmonad的重写。请注意,这与您的算法基本相同:

import Control.Monad.Par

findDivisors :: Integer -> [Integer]
findDivisors n = runPar $ do
    [f0, f1] <- sequence [new, new]
    fork $ put f0 (filter g [1..(quot n 4)])
    fork $ put f1 (filter g [(quot n 4)+1..(quot n 2)])
    [f0', f1'] <- sequence [get f0, get f1]
    return $ f0' ++ f1'
  where g z  = n `rem` z == 0

编译它-O2 -threaded -rtsopts -eventlog并运行 with+RTS -N2 -s会产生以下相关的运行时统计信息:

  36,000,130,784 bytes allocated in the heap
       3,165,440 bytes copied during GC
          48,464 bytes maximum residency (1 sample(s))

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     35162 colls, 35161 par    0.39s    0.32s     0.0000s    0.0006s
  Gen  1         1 colls,     1 par    0.00s    0.00s     0.0002s    0.0002s

  Parallel GC work balance: 1.32 (205296 / 155521, ideal 2)

  MUT     time   42.68s  ( 21.48s elapsed)
  GC      time    0.39s  (  0.32s elapsed)
  Total   time   43.07s  ( 21.80s elapsed)

  Alloc rate    843,407,880 bytes per MUT second

  Productivity  99.1% of total user, 195.8% of total elapsed

生产力非常高。为了稍微改善 GC 工作平衡,我们可以增加 GC 分配区域大小;运行+RTS -N2 -s -A128M,例如:

  36,000,131,336 bytes allocated in the heap
          47,088 bytes copied during GC
          49,808 bytes maximum residency (1 sample(s))

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0       135 colls,   134 par    0.19s    0.10s     0.0007s    0.0009s
  Gen  1         1 colls,     1 par    0.00s    0.00s     0.0010s    0.0010s

  Parallel GC work balance: 1.62 (2918 / 1801, ideal 2)

  MUT     time   42.65s  ( 21.49s elapsed)
  GC      time    0.20s  (  0.10s elapsed)
  Total   time   42.85s  ( 21.59s elapsed)

  Alloc rate    843,925,806 bytes per MUT second

  Productivity  99.5% of total user, 197.5% of total elapsed

但这真的只是吹毛求疵。真实的故事来自 ThreadScope:

大量利用

两个核心的利用率基本上达到了最大值,因此可能不会发生额外的显着并行化(对于两个核心)。

这里Par有一些关于monad 的好笔记。

更新

替代算法的重写Par看起来像这样:

findDivisors ::  Integer -> [Integer]
findDivisors n = let sqrtn = floor (sqrt (fromInteger n)) in runPar $ do
    [a, b] <- sequence [new, new]
    fork $ put a [a | (a, b) <- [quotRem n x | x <- [1..sqrtn]], b == 0]
    firstDivs  <- get a
    fork $ put b [n `quot` x | x <- firstDivs, x /= sqrtn]
    secondDivs <- get b
    return $ firstDivs ++ secondDivs

但是你是对的,由于依赖于firstDivs.

Strategies通过参与并行评估列表推导的元素,您仍然可以在此处合并并行性。就像是:

import Control.Monad.Par
import Control.Parallel.Strategies

findDivisors ::  Integer -> [Integer]
findDivisors n = let sqrtn = floor (sqrt (fromInteger n)) in runPar $ do
    [a, b] <- sequence [new, new]
    fork $ put a 
        ([a | (a, b) <- [quotRem n x | x <- [1..sqrtn]], b == 0] `using` parListChunk 2 rdeepseq)
    firstDivs  <- get a
    fork $ put b 
        ([n `quot` x | x <- firstDivs, x /= sqrtn] `using` parListChunk 2 rdeepseq)
    secondDivs <- get b
    return $ firstDivs ++ secondDivs

运行它会给出一些统计数据,例如

       3,388,800 bytes allocated in the heap
          43,656 bytes copied during GC
          68,032 bytes maximum residency (1 sample(s))

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0         5 colls,     4 par    0.00s    0.00s     0.0000s    0.0001s
  Gen  1         1 colls,     1 par    0.00s    0.00s     0.0002s    0.0002s

  Parallel GC work balance: 1.22 (2800 / 2290, ideal 2)

                        MUT time (elapsed)       GC time  (elapsed)
  Task  0 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task  1 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task  2 (bound)  :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task  3 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)

  SPARKS: 50 (49 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)

  MUT     time    0.01s  (  0.00s elapsed)
  GC      time    0.00s  (  0.00s elapsed)
  Total   time    0.01s  (  0.01s elapsed)

  Alloc rate    501,672,834 bytes per MUT second

  Productivity  85.0% of total user, 95.2% of total elapsed

在这里转换了近 50 个火花——也就是说,正在进行有意义的并行工作——但计算量不足以观察并行性带来的任何挂钟增益。任何收益都可能被线程运行时调度计算的开销所抵消。

于 2012-09-18T18:10:34.773 回答
1

我认为这个页面比我能更好地解释它:

http://www.haskell.org/haskellwiki/ThreadScope_Tour/SparkOverview

我还发现这些幻灯片很有趣:

http://haskellwiki.gitit.net/Upload/HIW2011-Talk-Coutts.pdf

于 2012-09-18T10:47:57.053 回答
0

我用以下内容修改原始代码我有更好的加速但是我认为这个代码不能并行化

findDivisors2 :: Integer->[Integer]
findDivisors2 n= let firstDivs=[a|(a,b)<-[quotRem n x|x<-[1..sqrtn]],b==0]
                 secondDivs=[n `quot` x|x<-firstDivs,x/=sqrtn]
                 sqrtn = floor(sqrt (fromInteger n))
               in firstDivs ++ secondDivs

我试图用这个来并行化代码:

parfindDivisors2 :: Integer->[Integer]
parfindDivisors2 n= let firstDivs=[a|(a,b)<-[quotRem n x|x<-[1..sqrtn]],b==0]
                 secondDivs=[n `quot` x|x<-firstDivs,x/=sqrtn]
                 sqrtn = floor(sqrt (fromInteger n))
               in  secondDivs `par` firstDivs++secondDivs

我没有减少时间,而是将时间增加了一倍。我认为发生这种情况是因为 findDivisors2 具有很强的数据依赖性,而第一个算法findDivisors则没有。

欢迎任何意见。

于 2012-09-18T20:28:55.590 回答