list - 在 Haskell 中即时减少列表

Question

假设我有一个f接受一些输入并产生一个数字的函数。在函数f中，根据输入创建一个列表，然后将其减少（例如使用foldl' g）以产生最终输出数。因为中间列表毕竟是要归约的，所以是否可以应用reduce函数g 而不表示中间列表。这里的目标是限制用于存储（或表达，如果“存储”一个不太准确的词）列表的内存。

为了说明这一点，此函数为中间列表foldPairProduct占用O(N1 * N2)空间（由于表达式和惰性求值，消耗的空间可能更复杂，但我认为它是成比例的或更糟）。这N1, N2是两个输入列表的大小。

foldPairProduct :: (Num a, Ord a)  => (a -> a -> a) -> [a] -> [a] -> a
foldPairProduct f xs ys = foldl1 f [ x*y | x <- xs, y <- ys]

该逻辑的另一种实现是 foldPairProduct'，它占用O(2 * 2)空间。

foldPairProduct' :: Num a => (Maybe a -> Maybe a -> Maybe a) -> [a] -> [a] -> Maybe a  
foldPairProduct' _ _ [] = Nothing
foldPairProduct' _ [] _ = Nothing
foldPairProduct' f (x:xs) (y:ys) = 
  foldl1 f [Just $ x*y, foldPairProduct' f [x] ys, foldPairProduct' f xs [y], 
            foldPairProduct' f xs ys]

除了它接受多个列表作为输入之外，它foldCrossProduct的实现类似于它，这种情况更加严重。foldPairProduct中间列表的空间复杂度（仍然是命令式语言的意义上）是O(N1 * N2 * ...* Nk)，其中k是的长度[[a]]。

foldCrossProduct :: Num a => (a -> a -> a) -> [[a]]  -> a
foldCrossProduct f xss = foldl1 f (crossProduct xss)

crossProduct :: Num a => [[a]] -> [a]
crossProduct [] = []
crossProduct (xs:[]) = xs
crossProduct (xs:xss) = [x * y | x <- xs, y <- crossProduct xss]

如果我们遵循的实现思路foldPairProduct'，空间复杂度将为k^2，空间效率更高。我的问题是：

我实现foldPairProduct'了一对列表。但是，似乎为任意数量的列表实现它并不简单。
我的意思不是将 Haskell 与命令式语言进行比较，但是是否有一个使用常量空间的实现（或者换句话说，不表达上述长度的中间列表）？也许 Monad 会有所帮助，但我对它很陌生。
编译器真的发挥了它的魔力吗？也就是说，它注意到列表是中间的并且要减少，并且确实找到了一种节省空间的方法来评估它。毕竟，我认为惰性求值和编译器优化就是为此而设计的。
欢迎任何评论。谢谢。

更新 1

性能测试证实了对和的“空间复杂度”的分析foldPairProduct，foldCrossProduct基于改变输入大小N1, N2, N3，并观察了 GC 复制的字节数。

性能测试证明了foldPairProduct'令人惊讶地显示N1 * N2甚至更差空间使用的分析。这可能是由于递归调用的评估效率低下。结果附在下面（ghc 设置与 Yuras 所做的相同）。

更新 2

当我从评论和答案中学习时，更新了一些进一步的实验。对于foldPairProduct，使用的总内存与 Daniel Fischer 解释的空间复杂度一致。

~~因为foldCrossProduct，尽管 Daniel 的复杂性分析对我来说很有意义，但结果并没有显示出线性的内存使用情况。~~按照丹尼尔的建议，交换了x <- xsand y <- crossproduct ys，它确实实现了线性空间复杂度。

对于foldCrossProduct (max) [[1..100],[1..n], [1..1000]]，当 n = 100、1000、10000、100000 时，使用的内存为 2、2、3、14 MB。

foldPairProduct [1..n] [1..10000]

n = 100
  120,883,320 bytes allocated in the heap 
   56,867,728 bytes copied during GC
      428,384 bytes maximum residency (50 sample(s)) 
       98,664 bytes maximum slop
            3 MB total memory in use (0 MB lost due to fragmentation)     

n = 1000
 1,200,999,280 bytes allocated in the heap 
   569,837,360 bytes copied during GC   
       428,384 bytes maximum residency (500 sample(s))
        99,744 bytes maximum slop 
             3 MB total memory in use (0 MB lost due to fragmentation) 
n = 10000

  12,002,152,040 bytes allocated in the heap
   5,699,468,024 bytes copied during GC 
         428,384 bytes maximum residency (5000 sample(s))
          99,928 bytes maximum slop 
               3 MB total memory in use (0 MB lost due to fragmentation)

n = 100000

 120,013,672,800 bytes allocated in the heap 
  56,997,625,608 bytes copied during GC 
         428,384 bytes maximum residency (50000 sample(s)) 
          99,984 bytes maximum slop 
               3 MB total memory in use (0 MB lost due to fragmentation)

foldPairProduct [1..10000] [1..n]

n = 100

     121,438,536 bytes allocated in the heap 
          55,920 bytes copied during GC     
          32,408 bytes maximum residency (1 sample(s)) 
          19,856 bytes maximum slop  
               1 MB total memory in use (0 MB lost due to fragmentation)

n = 1000

   1,201,511,296 bytes allocated in the heap 
         491,864 bytes copied during GC     
          68,392 bytes maximum residency (1 sample(s)) 
          20,696 bytes maximum slop                   
               1 MB total memory in use (0 MB lost due to fragmentation)

n = 10000

  12,002,232,056 bytes allocated in the heap 
   5,712,004,584 bytes copied during GC     
         428,408 bytes maximum residency (5000 sample(s)) 
          98,688 bytes maximum slop 
               3 MB total memory in use (0 MB lost due to fragmentation)

n = 100000

 120,009,432,816 bytes allocated in the heap
  81,694,557,064 bytes copied during GC 
       4,028,408 bytes maximum residency (10002 sample(s))
         769,720 bytes maximum slop 
              14 MB total memory in use (0 MB lost due to fragmentation)

foldPairProduct [1..n] [1..n]

n = 100
 1,284,024 bytes allocated in the heap
    15,440 bytes copied during GC
    32,336 bytes maximum residency (1 sample(s))
    19,920 bytes maximum slop                  
         1 MB total memory in use (0 MB lost due to fragmentation)  

n = 1000
 120,207,224 bytes allocated in the heap  
     114,848 bytes copied during GC 
      68,336 bytes maximum residency (1 sample(s)) 
      24,832 bytes maximum slop 
           1 MB total memory in use (0 MB lost due to fragmentation)  

n = 10000

  12,001,432,024 bytes allocated in the heap 
   5,708,472,592 bytes copied during GC 
         428,336 bytes maximum residency (5000 sample(s)) 
          99,960 bytes maximum slop 
               3 MB total memory in use (0 MB lost due to fragmentation) 

n = 100000
 1,200,013,672,824 bytes allocated in the heap 
   816,574,713,664 bytes copied during GC 
         4,028,336 bytes maximum residency (100002 sample(s)) 
           770,264 bytes maximum slop 
                14 MB total memory in use (0 MB lost due to fragmentation)

foldCrossProduct (max) [[1..n], [1..100], [1..1000]]

n = 100
     105,131,320 bytes allocated in the heap 
      38,697,432 bytes copied during GC     
         427,832 bytes maximum residency (34 sample(s)) 
         209,312 bytes maximum slop 
               3 MB total memory in use (0 MB lost due to fragmentation)

n = 1000
   1,041,254,480 bytes allocated in the heap 
     374,148,224 bytes copied during GC 
         427,832 bytes maximum residency (334 sample(s))
         211,936 bytes maximum slop 
               3 MB total memory in use (0 MB lost due to fragmentation)

n = 10000
  10,402,479,240 bytes allocated in the heap 
   3,728,429,728 bytes copied during GC     
         427,832 bytes maximum residency (3334 sample(s))
         215,936 bytes maximum slop
               3 MB total memory in use (0 MB lost due to fragmentation)

foldCrossProduct (max) [[1..100], [1..n], [1..1000]]

n = 100
     105,131,344 bytes allocated in the heap 
      38,686,648 bytes copied during GC  
         431,408 bytes maximum residency (34 sample(s)) 
         205,456 bytes maximum slop 
               3 MB total memory in use (0 MB lost due to fragmentation)

n = 1000
   1,050,614,504 bytes allocated in the heap
     412,084,688 bytes copied during GC 
       4,031,456 bytes maximum residency (53 sample(s)) 
       1,403,976 bytes maximum slop
              15 MB total memory in use (0 MB lost due to fragmentation)    
n = 10000
    quit after over 1362 MB total memory in use (0 MB lost due to fragmentation)

foldPairProduct' [1..n] [1..n]

n = 100
 4,351,176 bytes allocated in the heap
    59,432 bytes copied during GC    
    74,296 bytes maximum residency (1 sample(s))
    21,320 bytes maximum slop                  
         1 MB total memory in use (0 MB lost due to fragmentation)

n = 1000
 527,009,960 bytes allocated in the heap 
  45,827,176 bytes copied during GC 
     211,680 bytes maximum residency (1 sample(s)) 
      25,760 bytes maximum slop 
           2 MB total memory in use (0 MB lost due to fragmentation)

score 4 · Accepted Answer

列表的创建/修改/消耗有特定的优化，称为循环融合。因为 Haskell 是纯粹且非严格的，所以有很多规律，例如map f . mag g == map (f . g).

如果编译器由于某种原因无法识别代码并生成次优代码（在传递-O标志之后），我会详细研究流融合，看看是什么阻止了它。

score 4 · Accepted Answer

（好吧，我错了，它不会在恒定空间中工作，因为其中一个列表被多次使用，所以它很可能具有线性空间复杂度）

您是否尝试编译启用优化的测试程序？你foldPairProduct看起来对我很好，我希望它能够在恒定的空间中工作。

添加：是的，它在恒定空间中工作（使用的总内存为 3 MB）：

shum@shum-laptop:/tmp/shum$ cat test.hs 

foldPairProduct f xs ys = foldl1 f [ x*y | x <- xs, y <- ys]

n :: Int
n = 10000

main = print $ foldPairProduct (+) [1..n] [1..n]
shum@shum-laptop:/tmp/shum$ ghc --make -fforce-recomp -O test.hs 
[1 of 1] Compiling Main             ( test.hs, test.o )
Linking test ...
shum@shum-laptop:/tmp/shum$ time ./test +RTS -s
2500500025000000
  10,401,332,232 bytes allocated in the heap
   3,717,333,376 bytes copied during GC
         428,280 bytes maximum residency (3335 sample(s))
         219,792 bytes maximum slop
               3 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     16699 colls,     0 par    4.27s    4.40s     0.0003s    0.0009s
  Gen  1      3335 colls,     0 par    1.52s    1.52s     0.0005s    0.0012s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    2.23s  (  2.17s elapsed)
  GC      time    5.79s  (  5.91s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    8.02s  (  8.08s elapsed)

  %GC     time      72.2%  (73.2% elapsed)

  Alloc rate    4,659,775,665 bytes per MUT second

  Productivity  27.8% of total user, 27.6% of total elapsed


real    0m8.085s
user    0m8.025s
sys 0m0.040s
shum@shum-laptop:/tmp/shum$

score 2 · Accepted Answer

foldPairProduct :: (Num a, Ord a)  => (a -> a -> a) -> [a] -> [a] -> a
foldPairProduct f xs ys = foldl1 f [ x*y | x <- xs, y <- ys]

可以成为一个好的记忆公民。第二个参数ys重复使用，因此在计算过程中必须完全在内存中，但中间列表在消耗时会延迟生成，因此仅贡献恒定数量的内存，从而提供整体O(length ys)空间复杂度。当然，必须有length xs * length ys列表单元产生和消耗，所以总体分配是O(length xs * length ys)[假设每个a值使用有界空间]。通过提供更大的分配区域，可以大大减少在 GC 期间复制的字节数（以及因此 GC 所需的时间），其中+RTS -A1M，数量从

3,717,333,376 bytes copied during GC

默认设置为

20,445,728 bytes copied during GC

以及从GC time 4.88s到和GC time 0.07s的时间。xs == ys = [1 .. 10000] :: [Int]f = (+)

但这取决于严格性分析器的工作——如果它使用的类型是例如Int并且在编译期间已知，并且已知组合函数是严格的，那么它会很好。如果代码不是专门的，或者如果组合函数不知道是严格的，则折叠将产生一个O(length xs * length ys)大小的 thunk。这个问题可以通过使用 stricter 来缓解foldl1'。

foldPairProduct' :: Num a => (Maybe a -> Maybe a -> Maybe a) -> [a] -> [a] -> Maybe a  
foldPairProduct' _ _ [] = Nothing
foldPairProduct' _ [] _ = Nothing
foldPairProduct' f (x:xs) (y:ys) = 
  foldl1 f [Just $ x*y, foldPairProduct' f [x] ys, foldPairProduct' f xs [y], 
            foldPairProduct' f xs ys]

直接遇到了严格性不足的问题，Just这里编译器不能对构造函数包裹的值进行严格处理，因为整体结果可能不需要它，所以折叠经常会O(length xs * length ys)在Just-当然，对于某些f, like const，它会表现得很好。如果所有值都被使用，那么要成为一个良好的记忆公民，您必须使用足够严格的组合函数f，同时强制Just结果中的值（如果它是 a Just）；使用foldl1'也有帮助。这样，它可能具有O(length ys + length xs)空间复杂性（列表xs和ys被多次使用，因此被重用）。

foldCrossProduct :: Num a => (a -> a -> a) -> [[a]]  -> a
foldCrossProduct f xss = foldl1 f (crossProduct xss)

crossProduct :: Num a => [[a]] -> [a]
crossProduct [] = []
crossProduct (xs:[]) = xs
crossProduct (xs:xss) = [x * y | x <- xs, y <- crossProduct xss]

尽管 GHC 几乎不做 CSE（公共子表达式消除），但该列表crossProduct xss将（可能）在此处在不同x的 s 之间共享，因此会产生O(N2*...*Nk)空间复杂度。如果列表中元素的顺序无关紧要，则重新排序为

crossProduct (xs:xss) = [x * y | y <- crossProduct xss, x <- xs]

有帮助。然后crossProduct xss不需要一次在内存中，因此可以增量生产和消费，只xs需要记住，因为它被多次使用。对于递归调用，必须共享剩余列表中的第一个，这样会产生整体O(N1+...+Nk-1)空间复杂度。

list - 在 Haskell 中即时减少列表

3 回答 3

Related

Reference