作为一个小练习,我在 haskell 中制作了以下单词计数程序。它计算文本文件中不同的单词,并输出 50 个最常见的单词及其频率。
import qualified Data.Map as Map
import Data.List.Split
import Data.List
import Data.Ord
-- Count words
count = Map.toList . foldl' increment Map.empty
where
increment dict k = Map.insert k (1 + Map.findWithDefault 0 k dict) dict
-- Sort the counts
countAndSort = sortBy (flip $ comparing snd) . count
-- Pretty printing
pp :: Show a => [(String,a)] -> IO()
pp = putStrLn . foldl' format "" where
format text (x,y) = text ++ "\n" ++ x ++ "\t" ++ show y
main = readFile "pg13951.txt" >>= pp . take 50 .countAndSort . splitOn " "
问题是它比我使用可变 dict 的 python 实现慢 16 倍:
def increment(dic,word):
dic[word] = dic.get(word,0) + 1
return dic
print sorted(reduce(increment,open("pg13951.txt").read().split(),{}).items(),key=lambda e:-e[1])[:50]
我认为问题是由于 ghc 不断地重新分配新地图,而它可以一遍又一遍地重用同一个地图。运行时统计显示了很多分配:
$ ghc -rtsopts count.hs
$ ./count +RTS -sstderr
de 7682
et 4423
la 4238
<snip>
d'Artagnan 511
M. 502
c'est 443
d'Artagnan, 443
705,888,048 bytes allocated in the heap
655,511,720 bytes copied during GC
139,823,800 bytes maximum residency (10 sample(s))
1,049,416 bytes maximum slop
287 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1366 colls, 0 par 2.16s 2.26s 0.0017s 0.0072s
Gen 1 10 colls, 0 par 2.86s 3.09s 0.3093s 1.5055s
INIT time 0.00s ( 0.00s elapsed)
MUT time 3.18s ( 3.36s elapsed)
GC time 5.02s ( 5.36s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 8.20s ( 8.72s elapsed)
%GC time 61.2% (61.4% elapsed)
Alloc rate 221,831,366 bytes per MUT second
Productivity 38.8% of total user, 36.5% of total elapsed
我的问题是:有没有办法让这个程序表现得更好,而无需使用诸如在 IO monad 中工作、使用可变数据结构等肮脏的技巧?
PS:数据文件可在以下网址获得: http: //www.gutenberg.org/cache/epub/13951/pg13951.txt