c++ - 在每个主机线程上创建一个 cuda 流（多线程 CPU）

Question

我有一个多线程 CPU，我希望 CPU 的每个线程都能够启动单独的 CUDA 流。单独的 CPU 线程将在不同的时间做不同的事情，因此它们有可能不会重叠，但如果它们确实同时启动了 CUDA 内核，我希望它继续同时运行。

我很确定这是可能的，因为在 CUDA Toolkit 文档第 3.2.5.5 节中。它说“流是一系列命令（可能由不同的主机线程发出）......”

所以如果我想实现这个，我会做类似的事情

void main(int CPU_ThreadID) {
    cudaStream_t *stream;
    cudaStreamCreate(&stream);

    int *d_a;
    int *a;
    cudaMalloc((void**)&d_a, 100*sizeof(int));
    cudaMallocHost((void**)&a, 100*8*sizeof(int));
    cudaMemcpyAsync(d_a, a[100*CPU_ThreadID], 100*size(int), cudaMemcpyHostToDevice, stream);
    sum<<<100,32,0,stream>>>(d_a);

    cudaStreamDestroy(stream);
}

这只是一个简单的例子。如果我知道只有 8 个 CPU 线程，那么我知道最多会创建 8 个流。这是正确的方法吗？如果两个或更多不同的主机线程大约在同一时间到达此代码，这是否会同时运行？谢谢你的帮助！

编辑：

我更正了代码块中的一些语法问题，并按照 sgar91 的建议放入了 cudaMemcpyAsync。

score 3 · Accepted Answer

It really looks to me like you are proposing a multi-process application, not multithreaded. You don't mention which threading architecture you have in mind, nor even an OS, but the threading architectures I know of don't posit a thread routine called "main", and you haven't shown any preamble to the thread code.

A multi-process environment will generally create one device context per process, which will inhibit fine-grained concurrency.

Even if that's just an oversight, I would point out that a multi-threaded application should establish a GPU context on the desired device before threads are spawned.

Each thread can then issue a cudaSetDevice(0); or similar call, which should cause each thread to pick up the established context on the indicated device.

Once that is in place, you should be able to issue commands to the desired streams from whichever threads you like.

You may wish to refer to the cudaOpenMP sample code. Although it omits the streams concepts, it demonstrates a multi-threaded app with the potential for multiple threads to issue commands to the same device (and could be extended to the same stream)

Whether or not kernels happen to run concurrently or not after the above issues have been addressed is a separate issue. Concurrent kernel execution has a number of requirements, and the kernels themselves must have compatible resource requirements (blocks, shared memory, registers, etc.), which generally implies "small" kernels.

c++ - 在每个主机线程上创建一个 cuda 流（多线程 CPU）

1 回答 1

Related

Reference