15

我正在尝试围绕并行策略进行思考。我想我理解每个组合器的作用,但是每次我尝试将它们与超过 1 个内核一起使用时,程序都会大大减慢。

例如,不久前,我尝试从大约 700 个文档中计算直方图(并从中计算出唯一的单词)。我认为使用文件级粒度就可以了。我得到-N41.70 的工作平衡。但是,使用-N1它的运行时间比使用-N4. 我不确定问题到底是什么,但我想知道如何决定在何处/何时/如何进行并行化并对此有所了解。这将如何并行化,以使速度随着内核的增加而不是降低?

import Data.Map (Map)
import qualified Data.Map as M
import System.Directory
import Control.Applicative
import Data.Vector (Vector)
import qualified Data.Vector as V
import qualified Data.Text as T
import qualified Data.Text.IO as TI
import Data.Text (Text)
import System.FilePath ((</>))
import Control.Parallel.Strategies
import qualified Data.Set as S
import Data.Set (Set)
import GHC.Conc (pseq, numCapabilities)
import Data.List (foldl')

mapReduce stratm m stratr r xs = let
  mapped = parMap stratm m xs
  reduced = r mapped `using` stratr
  in mapped `pseq` reduced

type Histogram = Map Text Int

rootDir = "/home/masse/Documents/text_conversion/"

finnishStop = ["minä", "sinä", "hän", "kuitenkin", "jälkeen", "mukaanlukien", "koska", "mutta", "jos", "kuitenkin", "kun", "kunnes", "sanoo", "sanoi", "sanoa", "miksi", "vielä", "sinun"]
englishStop = ["a","able","about","across","after","all","almost","also","am","among","an","and","any","are","as","at","be","because","been","but","by","can","cannot","could","dear","did","do","does","either","else","ever","every","for","from","get","got","had","has","have","he","her","hers","him","his","how","however","i","if","in","into","is","it","its","just","least","let","like","likely","may","me","might","most","must","my","neither","no","nor","not","of","off","often","on","only","or","other","our","own","rather","said","say","says","she","should","since","so","some","than","that","the","their","them","then","there","these","they","this","tis","to","too","twas","us","wants","was","we","were","what","when","where","which","while","who","whom","why","will","with","would","yet","you","your"]
isStopWord :: Text -> Bool
isStopWord x = x `elem` (finnishStop ++ englishStop)

textFiles :: IO [FilePath]
textFiles = map (rootDir </>) . filter (not . meta) <$> getDirectoryContents rootDir
  where meta "." = True
        meta ".." = True
        meta _ = False

histogram :: Text -> Histogram
histogram = foldr (\k -> M.insertWith' (+) k 1) M.empty . filter (not . isStopWord) . T.words

wordList = do
  files <- mapM TI.readFile =<< textFiles
  return $ mapReduce rseq histogram rseq reduce files
  where
    reduce = M.unions

main = do
  list <- wordList
  print $ M.size list

至于文本文件,我正在使用转换为文本文件的 pdf,因此我无法提供它们,但出于此目的,几乎所有来自古腾堡项目的书籍/书籍都应该这样做。

编辑:向脚本添加了导入

4

2 回答 2

4

我认为 Daniel 说得对—— Data.Map 和列表是一种惰性数据结构;您应该同时使用 foldl'insertWith' 来确保每个块的工作都急切完成 --- 否则所有工作都会延迟到顺序部分(减少)。

同样,为每个文件制作火花是否是正确的粒度并不明显,特别是在文件大小差异很大的情况下。如果是这种情况,最好将单词列表连接起来并分成偶数大小的块(参见 parListChunk 组合器)。

当您使用它时,我还将查看使用惰性 IO (readFile) 打开许多文件的一些陷阱(运行时系统可能会用完文件句柄,因为它持有它们的时间过长)。

于 2013-01-31T14:02:31.630 回答
4

在实践中,让并行组合器很好地扩展可能很困难。其他人提到让你的代码更加严格,以确保你实际上是在并行工作,这绝对是重要的。

真正会影响性能的两件事是大量的内存遍历和垃圾收集。即使你没有产生很多垃圾,大量的内存遍历也会给 CPU 缓存带来更大的压力,最终你的内存总线会成为瓶颈。您的isStopWord函数执行大量字符串比较,并且必须遍历一个相当长的链表才能这样做。Set您可以使用内置类型或更好 HashSet的包中的类型来节省大量工作unordered-containers(因为重复的字符串比较可能会很昂贵,特别是如果它们共享公共前缀)。

import           Data.HashSet                (HashSet)
import qualified Data.HashSet                as S

...

finnishStop :: [Text]
finnishStop = ["minä", "sinä", "hän", "kuitenkin", "jälkeen", "mukaanlukien", "koska", "mutta", "jos", "kuitenkin", "kun", "kunnes", "sanoo", "sanoi", "sanoa", "miksi", "vielä", "sinun"]
englishStop :: [Text]
englishStop = ["a","able","about","across","after","all","almost","also","am","among","an","and","any","are","as","at","be","because","been","but","by","can","cannot","could","dear","did","do","does","either","else","ever","every","for","from","get","got","had","has","have","he","her","hers","him","his","how","however","i","if","in","into","is","it","its","just","least","let","like","likely","may","me","might","most","must","my","neither","no","nor","not","of","off","often","on","only","or","other","our","own","rather","said","say","says","she","should","since","so","some","than","that","the","their","them","then","there","these","they","this","tis","to","too","twas","us","wants","was","we","were","what","when","where","which","while","who","whom","why","will","with","would","yet","you","your"]

stopWord :: HashSet Text
stopWord = S.fromList (finnishStop ++ englishStop)

isStopWord :: Text -> Bool
isStopWord x = x `S.member` stopWord

用这个版本替换你的isStopWord函数性能更好,扩展性更好(尽管绝对不是 1-1)。您也可以考虑使用HashMap(来自同一个包)而不是Map出于相同的原因,但这样做并没有明显改变。

另一种选择是增加默认堆大小以减轻 GC 的压力并为其提供更多移动空间。为编译后的代码提供 1GB 的默认堆大小(-H1G标志),我在 4 个内核上获得了大约 50% 的 GC 平衡,而没有我只有 ~25%(它的运行速度也快了 ~30%)。

通过这两个更改,四个核心(在我的机器上)的平均运行时间从 ~10.5s 下降到 ~3.5s。可以说,基于 GC 统计数据还有改进的空间(仍然只有 58% 的时间用于生产性工作),但要做得更好可能需要对算法进行更大幅度的更改。

于 2013-01-31T16:34:10.453 回答