我编写了两个函数来从未知长度的列表中选择一个随机元素。第一个使用水库采样(水库大小为 1),第二个获取列表的长度以选择一个随机索引并返回它。出于某种原因,前者要快得多。
第一个函数使用单次遍历并以概率 (1/i) 选取每个元素,其中 i 是列表中元素的索引。它导致选择每个元素的概率相等。
pickRandom :: [a] -> IO a
pickRandom [] = error "List is empty"
pickRandom (x:xs) = do
stdgen <- newStdGen
return (pickRandom' xs x 1 stdgen)
-- Pick a random number using reservoir sampling
pickRandom' :: (RandomGen g) => [a] -> a -> Int -> g -> a
pickRandom' [] xi _ _ = xi
pickRandom' (x:xs) xi n gen =
let (rand, gen') = randomR (0, n) gen in
if (rand == 0) then
pickRandom' xs x (n + 1) gen' -- Update value
else
pickRandom' xs xi (n + 1) gen' -- Keep previous value
第二个版本遍历列表一次以获得它的长度,然后选择一个介于 0 和输入列表长度 (-1) 之间的索引来获得元素之一,再次以相等的概率。列表1.5的预期遍历次数:
-- Traverses the list twice
pickRandomWithLen :: [a] -> IO a
pickRandomWithLen [] = error "List is empty"
pickRandomWithLen xs = do
gen <- newStdGen
(e, _) <- return $ randomR (0, (length xs) - 1) gen
return $ xs !! e
这是我用于对这两个函数进行基准测试的代码:
main :: IO ()
main = do
gen <- newStdGen
let size = 2097152
inputList = getRandList gen size
defaultMain [ bench "Using length" (pickRandomWithLen inputList)
, bench "Using reservoir" (pickRandom inputList)
]
这是一个剥离的输出:
benchmarking Using reservoir
mean: 82.72108 ns, lb 82.02459 ns, ub 83.61931 ns, ci 0.950
benchmarking Using length
mean: 17.12571 ms, lb 16.97026 ms, ub 17.37352 ms, ci 0.950
换句话说,第一个函数比第二个函数快大约 200 倍。我预计运行时间主要受随机数生成和列表遍历次数(1 vs. 1.5)的影响。还有什么其他因素可以解释如此巨大的差异?