cuda - 创建 CUDA 上下文的区别

Question

我有一个使用三个内核的程序。为了获得加速，我正在做一个虚拟内存复制来创建一个上下文，如下所示：

__global__ void warmStart(int* f)
{
    *f = 0;
}

它是在我想要计时的内核之前启动的，如下所示：

int *dFlag = NULL;
cudaMalloc( (void**)&dFlag, sizeof(int) );
warmStart<<<1, 1>>>(dFlag);
Check_CUDA_Error("warmStart kernel");

我还阅读了其他最简单的方法来创建上下文作为 o cudaFree(0)r cudaDevicesynchronize()。但是使用这些 API 调用比使用虚拟内核要糟糕得多。

在强制上下文之后，程序的执行时间0.000031对于虚拟内核来说是秒0.000064，对于 cudaDeviceSynchronize() 和 cudaFree(0) 来说都是秒。时间是程序执行 10 次的平均值。

因此，我得出的结论是，启动内核会初始化一些在以规范方式创建上下文时未初始化的内容。

那么，使用内核和使用 API 调用这两种方式创建上下文有什么区别呢？

我在 GTX480 上运行测试，在 Linux 下使用 CUDA 4.0。

score 3 · Accepted Answer

Each CUDA context has memory allocations that are required to execute a kernel that are not required to be allocated to syncrhonize, allocate memory, or free memory. The initial allocation of the context memory and resizing of these allocations is deferred until a kernel requires these resources. Examples of these allocations include the local memory buffer, device heap, and printf heap.

cuda - 创建 CUDA 上下文的区别

1 回答 1

Related

Reference