f# - Alea 没有正确处理内存

Question

以下 F# 代码在第三次调用时崩溃，没有内存异常。要么我遗漏了某些东西，要么 Alea 出于某种原因没有正确释放内存。我在 F# Interactive 和 Compiled 中都试过了。我也尝试过手动调用 dispose ，但它不起作用。知道为什么吗？

let squareGPU (inputs:float[]) =
        use dInputs = worker.Malloc(inputs)
        use dOutputs = worker.Malloc(inputs.Length)
        let blockSize = 256
        let numSm = worker.Device.Attributes.MULTIPROCESSOR_COUNT
        let gridSize = Math.Min(16 * numSm, divup inputs.Length blockSize)
        let lp = new LaunchParam(gridSize, blockSize)
        worker.Launch <@ squareKernel @> lp dOutputs.Ptr dInputs.Ptr inputs.Length
        dOutputs.Gather()


let x = squareGPU [|0.0..0.001..100000.0|]
printfn "1" 
let y = squareGPU [|0.0..0.001..100000.0|]
printfn "2" 
let z = squareGPU [|0.0..0.001..100000.0|]
printfn "3"

score 2 · Accepted Answer

我猜你有System.OutOfMemoryException，对吧？这并不意味着 GPU 设备内存耗尽，这意味着您的主机内存即将耗尽。在您的示例中，您在主机中创建了一个相当大的数组，然后计算它，然后收集另一个大数组作为输出。关键是，您使用不同的变量名称（x、y 和 z）来存储输出数组，因此 GC 将没有机会释放它，因此最终您将耗尽您的主机内存。

我做了一个非常简单的测试（我在你的例子中使用停止值 30000 而不是 100000），这个测试只使用主机代码，没有 GPU 代码：

let x1 = [|0.0..0.001..30000.0|]
printfn "1" 
let x2 = [|0.0..0.001..30000.0|]
printfn "2" 
let x3 = [|0.0..0.001..30000.0|]
printfn "3"
let x4 = [|0.0..0.001..30000.0|]
printfn "4"
let x5 = [|0.0..0.001..30000.0|]
printfn "5"
let x6 = [|0.0..0.001..30000.0|]
printfn "6"

我在 F# 交互式（这是一个 32 位进程）中运行了这段代码，我得到了这个：

Microsoft (R) F# Interactive version 12.0.30815.0
Copyright (c) Microsoft Corporation. All Rights Reserved.

For help type #help;;

> 
1
2
System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at System.Collections.Generic.List`1.set_Capacity(Int32 value)
   at System.Collections.Generic.List`1.EnsureCapacity(Int32 min)
   at System.Collections.Generic.List`1.Add(T item)
   at Microsoft.FSharp.Collections.SeqModule.ToArray[T](IEnumerable`1 source)
   at <StartupCode$FSI_0002>.$FSI_0002.main@() in C:\Users\Xiang\Documents\Inbox\ConsoleApplication6\Script1.fsx:line 32
Stopped due to error
>

这意味着，在我创建了 2 个如此大的数组（x1 和 x2）之后，我的主机内存用完了。

为了进一步证实这一点，我使用了相同的变量名，这让 GC 有机会收集旧数组，这一次它起作用了：

let foo() =
    let x = [|0.0..0.001..30000.0|]
    printfn "1" 
    let x = [|0.0..0.001..30000.0|]
    printfn "2" 
    let x = [|0.0..0.001..30000.0|]
    printfn "3"
    let x = [|0.0..0.001..30000.0|]
    printfn "4"
    let x = [|0.0..0.001..30000.0|]
    printfn "5"
    let x = [|0.0..0.001..30000.0|]
    printfn "6"

> 

val foo : unit -> unit

> foo()
;;
1
2
3
4
5
6
val it : unit = ()
>

如果我添加 GPU 内核，它仍然可以工作：

let foo() =
    let x = squareGPU [|0.0..0.001..30000.0|]
    printfn "1" 
    let x = squareGPU [|0.0..0.001..30000.0|]
    printfn "2" 
    let x = squareGPU [|0.0..0.001..30000.0|]
    printfn "3"
    let x = squareGPU [|0.0..0.001..30000.0|]
    printfn "4"
    let x = squareGPU [|0.0..0.001..30000.0|]
    printfn "5"
    let x = squareGPU [|0.0..0.001..30000.0|]
    printfn "6"
    let x = squareGPU [|0.0..0.001..30000.0|]
    printfn "7"
    let x = squareGPU [|0.0..0.001..30000.0|]
    printfn "8"

> foo();;
1
2
3
4
5
6
7
8
val it : unit = ()
>

或者，您可以尝试使用 64 位进程。

score 0 · Accepted Answer

GC 在单独的后台线程中工作，因此如果您频繁地新建巨大的数组，它很容易抛出内存异常。

在这种大数组的情况下，我建议你使用“就地修改”的方式，这样会更稳定。我创建了一个测试来显示：（注意，由于数组非常大，您最好转到项目属性页面，在 Build 选项卡中，取消选中“Prefer 32-bit”，以确保它以 64 位运行过程）

open System
open Alea.CUDA
open Alea.CUDA.Utilities
open NUnit.Framework

[<ReflectedDefinition>]
let squareKernel (outputs:deviceptr<float>) (inputs:deviceptr<float>) (n:int) =
    let start = blockIdx.x * blockDim.x + threadIdx.x
    let stride = gridDim.x * blockDim.x
    let mutable i = start 
    while i < n do
        outputs.[i] <- inputs.[i] * inputs.[i]
        i <- i + stride

let squareGPUInplaceUpdate (worker:Worker) (lp:LaunchParam) (hData:float[]) (dData:DeviceMemory<float>) =
    // instead of malloc a new device memory, you just reuse the device memory dData
    // and scatter new data to it.
    dData.Scatter(hData)
    worker.Launch <@ squareKernel @> lp dData.Ptr dData.Ptr hData.Length
    // actually, there should be a counterpart of data.Scatter(hData) like data.Gather(hData)
    // but unfortunately, that is missing, but there is a workaround of using worker.Gather.
    worker.Gather(dData.Ptr, hData)

let squareGPUManyTimes (iters:int) =
    let worker = Worker.Default

    // actually during the many iters, you just malloc 2 host array (for data and expected value)
    // and you malloc a device array. You keep reusing them, since they are big array.
    // if you new the huge array very frequentely, GC is under pressure. and since GC works
    // as a separate thread, so you will get System.OutOfMemoryException from time to time.
    let hData = [|0.0..0.001..100000.0|]
    let n = hData.Length
    let expected = Array.zeroCreate n
    use dData = worker.Malloc<float>(n)

    let rng = Random()
    let update () =
        // in-place updating the data
        for i = 0 to n - 1 do
            hData.[i] <- rng.NextDouble()
            expected.[i] <- hData.[i] * hData.[i]

    let lp =
        let blockSize = 256
        let numSm = worker.Device.Attributes.MULTIPROCESSOR_COUNT
        let gridSize = Math.Min(16 * numSm, divup n blockSize)
        new LaunchParam(gridSize, blockSize)

    for i = 1 to iters do
        update()
        squareGPUInplaceUpdate worker lp hData dData
        Assert.AreEqual(expected, hData)
        printfn "iter %d passed..." i

[<Test>]
let test() =
    squareGPUManyTimes 5

请注意，异常System.OutOfMemoryException总是意味着主机内存，如果 GPU 内存发现内存不足，则会抛出 CUDAException。

顺便说一句，每次调用 DeviceMemory.Gather() 时，它都会创建一个新的 .NET 数组并填充它。通过使用本示例中显示的就地方法，您提供了一个 .net 数组，并让它由来自设备的数据填充。

f# - Alea 没有正确处理内存

2 回答 2

Related

Reference