performance - GHC 中的跨模块优化

Question

如果我在同一个模块中测量它，我有一个非递归函数来计算似乎表现良好的最长公共子序列（用标志ghc 7.6.1编译）。另一方面，如果我将函数转换为模块，仅导出该函数（如此处推荐的），然后使用 Criterion 再次测量，我会得到约 2 倍的减速（如果我将标准测试移回模块，它就会消失其中定义了函数）。我尝试用pragma 标记函数，这对跨模块性能测量没有任何影响。-O2 -fllvmCriterionINLINE

在我看来，GHC 可能会进行严格性分析，当函数和主函数（从该函数可以访问）在同一个模块中时效果很好，但当它们被拆分时则不行。我会很感激有关如何模块化函数的指针，以便在从其他模块调用时它可以很好地执行。有问题的代码太大，无法粘贴到此处 -如果您想尝试一下，可以在此处查看。下面是我正在尝试做的一个小例子（带有代码片段）：

-- Function to find longest common subsequence given unboxed vectors a and b
-- It returns indices of LCS in a and b
lcs :: (U.Unbox a, Eq a) => Vector a -> Vector a -> (Vector Int,Vector Int)
lcs a b | (U.length a > U.length b) = lcsh b a True
        | otherwise = lcsh a b False

-- This section below measures performance of lcs function - if I move it to 
-- a different module, performance degrades ~2x - mean goes from ~1.25us to ~2.4us
-- on my test machine
{-- 
config :: Config
config = defaultConfig  { cfgSamples = ljust 100 }

a = U.fromList ['a'..'j'] :: Vector Char
b = U.fromList ['a'..'k'] :: Vector Char

suite :: [Benchmark]
suite = [
          bench "lcs 10" $ whnf (lcs a) b
        ]

main :: IO()
main = defaultMainWith config (return ()) suite
--}

score 14 · Accepted Answer

hammar 是对的，重要的问题是编译器可以在看到代码的同时lcs看到所使用的类型，因此它可以将代码专门用于该特定类型。

如果编译器不知道代码应该使用的类型，它就只能产生多态代码。这对性能不利 - 我很惊讶这里只有 ~2 倍的差异。多态代码意味着对于许多操作需要类型类查找，并且这至少使得内联查找的函数或常量折叠大小变得不可能[例如，对于未装箱的数组/向量访问]。

{-# SPECIALISE foo :: Char -> Int, foo :: Bool -> Integer #-}如果不使需要专门化的代码在使用站点可见（或者，如果您在实现站点知道所需的类型，则在那里专门化等），您无法获得与具有实现和在单独模块中使用的单模块情况相当的性能.

使代码在使用现场可见通常是通过标记函数在界面文件中显示展开来完成的{-# INLINABLE #-}。

我尝试用INLINEpragma 标记函数，这对跨模块性能测量没有任何影响。

仅标记

lcs :: (U.Unbox a, Eq a) => Vector a -> Vector a -> (Vector Int,Vector Int)
lcs a b | (U.length a > U.length b) = lcsh b a True
        | otherwise = lcsh a b False

INLINE或者INLINABLE当然没有什么不同，那个函数是微不足道的，编译器无论如何都会暴露它的展开，因为它太小了。即使它的展开没有暴露出来，差异也无法测量。

您还需要公开执行实际工作的函数的展开，至少是多态函数的展开lcsh，findSnakes,gridWalk和cmp(cmp在这里是至关重要的，但其他的是必要的 1. 看到cmp需要，2. 调用他们的专业人士cmp）。

制作那些INLINABLE，分离模块案例之间的区别

$ ./diffBench 
warming up
estimating clock resolution...
mean is 1.573571 us (320001 iterations)
found 2846 outliers among 319999 samples (0.9%)
  2182 (0.7%) high severe
estimating cost of a clock call...
mean is 40.54233 ns (12 iterations)

benchmarking lcs 10
mean: 1.628523 us, lb 1.618721 us, ub 1.638985 us, ci 0.950
std dev: 51.75533 ns, lb 47.04237 ns, ub 58.45611 ns, ci 0.950
variance introduced by outliers: 26.787%
variance is moderately inflated by outliers

和单模块案例

$ ./oneModule 
warming up
estimating clock resolution...
mean is 1.726459 us (320001 iterations)
found 2092 outliers among 319999 samples (0.7%)
  1608 (0.5%) high severe
estimating cost of a clock call...
mean is 39.98567 ns (14 iterations)

benchmarking lcs 10
mean: 1.523183 us, lb 1.514157 us, ub 1.533071 us, ci 0.950
std dev: 48.48541 ns, lb 44.43230 ns, ub 55.04251 ns, ci 0.950
variance introduced by outliers: 26.791%
variance is moderately inflated by outliers

小到可以忍受。

performance - GHC 中的跨模块优化

1 回答 1

Related

Reference