IMO,Par
单子有助于推理并行性。它比处理par
and高一点pseq
。
parfindDivisors
这是使用Par
monad的重写。请注意,这与您的算法基本相同:
import Control.Monad.Par
findDivisors :: Integer -> [Integer]
findDivisors n = runPar $ do
[f0, f1] <- sequence [new, new]
fork $ put f0 (filter g [1..(quot n 4)])
fork $ put f1 (filter g [(quot n 4)+1..(quot n 2)])
[f0', f1'] <- sequence [get f0, get f1]
return $ f0' ++ f1'
where g z = n `rem` z == 0
编译它-O2 -threaded -rtsopts -eventlog
并运行 with+RTS -N2 -s
会产生以下相关的运行时统计信息:
36,000,130,784 bytes allocated in the heap
3,165,440 bytes copied during GC
48,464 bytes maximum residency (1 sample(s))
Tot time (elapsed) Avg pause Max pause
Gen 0 35162 colls, 35161 par 0.39s 0.32s 0.0000s 0.0006s
Gen 1 1 colls, 1 par 0.00s 0.00s 0.0002s 0.0002s
Parallel GC work balance: 1.32 (205296 / 155521, ideal 2)
MUT time 42.68s ( 21.48s elapsed)
GC time 0.39s ( 0.32s elapsed)
Total time 43.07s ( 21.80s elapsed)
Alloc rate 843,407,880 bytes per MUT second
Productivity 99.1% of total user, 195.8% of total elapsed
生产力非常高。为了稍微改善 GC 工作平衡,我们可以增加 GC 分配区域大小;运行+RTS -N2 -s -A128M
,例如:
36,000,131,336 bytes allocated in the heap
47,088 bytes copied during GC
49,808 bytes maximum residency (1 sample(s))
Tot time (elapsed) Avg pause Max pause
Gen 0 135 colls, 134 par 0.19s 0.10s 0.0007s 0.0009s
Gen 1 1 colls, 1 par 0.00s 0.00s 0.0010s 0.0010s
Parallel GC work balance: 1.62 (2918 / 1801, ideal 2)
MUT time 42.65s ( 21.49s elapsed)
GC time 0.20s ( 0.10s elapsed)
Total time 42.85s ( 21.59s elapsed)
Alloc rate 843,925,806 bytes per MUT second
Productivity 99.5% of total user, 197.5% of total elapsed
但这真的只是吹毛求疵。真实的故事来自 ThreadScope:
两个核心的利用率基本上达到了最大值,因此可能不会发生额外的显着并行化(对于两个核心)。
这里Par
有一些关于monad 的好笔记。
更新
替代算法的重写Par
看起来像这样:
findDivisors :: Integer -> [Integer]
findDivisors n = let sqrtn = floor (sqrt (fromInteger n)) in runPar $ do
[a, b] <- sequence [new, new]
fork $ put a [a | (a, b) <- [quotRem n x | x <- [1..sqrtn]], b == 0]
firstDivs <- get a
fork $ put b [n `quot` x | x <- firstDivs, x /= sqrtn]
secondDivs <- get b
return $ firstDivs ++ secondDivs
但是你是对的,由于依赖于firstDivs
.
Strategies
通过参与并行评估列表推导的元素,您仍然可以在此处合并并行性。就像是:
import Control.Monad.Par
import Control.Parallel.Strategies
findDivisors :: Integer -> [Integer]
findDivisors n = let sqrtn = floor (sqrt (fromInteger n)) in runPar $ do
[a, b] <- sequence [new, new]
fork $ put a
([a | (a, b) <- [quotRem n x | x <- [1..sqrtn]], b == 0] `using` parListChunk 2 rdeepseq)
firstDivs <- get a
fork $ put b
([n `quot` x | x <- firstDivs, x /= sqrtn] `using` parListChunk 2 rdeepseq)
secondDivs <- get b
return $ firstDivs ++ secondDivs
运行它会给出一些统计数据,例如
3,388,800 bytes allocated in the heap
43,656 bytes copied during GC
68,032 bytes maximum residency (1 sample(s))
Tot time (elapsed) Avg pause Max pause
Gen 0 5 colls, 4 par 0.00s 0.00s 0.0000s 0.0001s
Gen 1 1 colls, 1 par 0.00s 0.00s 0.0002s 0.0002s
Parallel GC work balance: 1.22 (2800 / 2290, ideal 2)
MUT time (elapsed) GC time (elapsed)
Task 0 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s)
Task 1 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s)
Task 2 (bound) : 0.01s ( 0.01s) 0.00s ( 0.00s)
Task 3 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s)
SPARKS: 50 (49 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)
MUT time 0.01s ( 0.00s elapsed)
GC time 0.00s ( 0.00s elapsed)
Total time 0.01s ( 0.01s elapsed)
Alloc rate 501,672,834 bytes per MUT second
Productivity 85.0% of total user, 95.2% of total elapsed
在这里转换了近 50 个火花——也就是说,正在进行有意义的并行工作——但计算量不足以观察并行性带来的任何挂钟增益。任何收益都可能被线程运行时调度计算的开销所抵消。