memory-management - cudaMemGetInfo 不恒定？

Question

我正在测试动态分配，即

__device__ double *temp;
__global__
void test(){
    temp = new double[125000]; //1MB
}

调用此函数 100 次以查看内存是否在减少：

size_t free, total;
CUDA_CHECK(cudaMemGetInfo(&free, &total));  
fprintf(stdout,"\t### Available VRAM : %g Mo/ %g Mo(total)\n\n", free/pow(10., 6), total/pow(10., 6)); 

for(int t=0;t<100;t++){
        test<<<1, 1>>>();
        CUDA_CHECK(cudaDeviceSynchronize());  
        fprintf(stdout,"\t### Available VRAM : %g Mo/ %g Mo(total)\n\n", free/pow(10., 6), total/pow(10., 6));
    }
CUDA_CHECK(cudaMemGetInfo(&free, &total));
fprintf(stdout,"\t### Available VRAM : %g Mo/ %g Mo(total)\n\n", free/pow(10., 6), total/pow(10., 6));

它实际上是。

注意：当尝试不调用函数和循环内的 cudaMemGetInfo 时，它从 800 减少到 650 个月，我得出的结论是控制台的输出大约需要 150 个月。确实，当尝试上面写的代码时，结果没有改变。但它是巨大的！
循环后我的可用内存减少了约 50Mo（我希望通过评论对内核的调用没有任何减少）。当我在内核中添加一个 delete(temp) 时，似乎并没有减少多少浪费的内存，我仍然减少了 ~30Mo。为什么？
在循环之后使用 cudaFree(temp) 或 cudadeviceReset() 也无济于事。为什么？以及如何释放内存？

score 3 · Accepted Answer

听起来您确实需要先阅读此问答对，然后再走得更远。

您在内核中分配的内存new来自静态运行时堆，该堆是作为“惰性”上下文建立的一部分分配的，该上下文建立由程序运行时由 CUDA 运行时启动。建立上下文的第一个 CUDA 调用还将加载包含内核代码的模块，并为随后的内核调用保留本地内存、运行时缓冲区和运行时堆。这就是您观察到的大部分内存消耗的来源。运行时 API 包含一个允许用户控制分配大小的调用。

您应该会发现在 CUDA 版本 4 或 5 上执行以下操作：

size_t free, total;
CUDA_CHECK(cudaMemGetInfo(&free, &total));  
fprintf(stdout,"\t### Available VRAM : %g Mo/ %g Mo(total)\n\n", 
                    free/1e6, total/1e6); 

cudaFree(0);

CUDA_CHECK(cudaMemGetInfo(&free, &total));  
fprintf(stdout,"\t### Available VRAM : %g Mo/ %g Mo(total)\n\n", 
                    free/1e6, total/1e6); 

// Kernel loop follows

[免责声明：写在浏览器中，使用风险自负]

应该显示cudaFree(0)调用后可用内存下降，因为这应该启动上下文初始化序列，这会消耗 GPU 上的内存。

memory-management - cudaMemGetInfo 不恒定？

1 回答 1

Related

Reference