我正在利用 OpenCL 的 enqueue_kernel() 函数将内核从 GPU 动态排入队列,以减少不必要的主机交互。这是我在内核中尝试做的一个简化示例:
kernel void kernelA(args)
{
//This kernel is the one that is enqueued from the host, with only one work item. This kernel
//could be considered the "master" kernel that controls the logic of when to enqueue tasks
//First, it checks if a condition is met, then it enqueues kernelB
if (some condition)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(some amount, 256), ^{kernelB(args);});
}
else
{
//do other things
}
}
kernel void kernelB(args)
{
//Do some stuff
//Only enqueue the next kernel with the first work item. I do this because the things
//occurring in kernelC rely on the things that kernelB does, so it must take place after kernelB is completed,
//hence, the CLK_ENQUEUE_FLAGS_WAIT_KERNEL
if (get_global_id(0) == 0)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(some amount, 256), ^{kernelC(args);});
}
}
kernel void kernelC(args)
{
//Do some stuff. This one in particular is one step in a sorting algorithm
//This kernel will enqueue kernelD if a condition is met, otherwise it will
//return to kernelA
if (get_global_id(0) == 0 && other requirements)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(1, 1), ^{kernelD(args);});
}
else if (get_global_id(0) == 0)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(1, 1), ^{kernelA(args);});
}
}
kernel void kernelD(args)
{
//Do some stuff
//Finally, if some condition is met, enqueue kernelC again. What this will do is it will
//bounce back and forth between kernelC and kernelD until the condition is
//no longer met. If it isn't met, go back to kernelA
if (some condition)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(some amount, 256), ^{kernelC(args);});
}
else
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(1, 1), ^{kernelA(args);});
}
}
这就是程序的一般流程,它完美地工作并且完全按照我的意图做,按照我打算做的确切顺序,除了一个问题。在某些情况下,当工作负载非常高时,随机的 enqueue_kernel() 之一将无法入队并停止程序。发生这种情况是因为设备队列已满,无法将其他任务放入其中。但我终其一生都无法弄清楚为什么会这样,即使经过广泛的研究。
我认为一旦队列中的任务(例如内核)完成,它将释放队列中的那个位置。所以我的队列实际上应该一次最多只能达到 1 或 2 个任务。但是这个程序实际上会填满设备命令队列的整个 262,144 字节大小,并停止运行。
如果有人有任何想法,我将非常感谢一些关于为什么会发生这种情况的潜在见解。我有点卡住了,在我解决这个问题之前无法继续。
先感谢您!
(顺便说一句,我在 Radeon RX 590 卡上运行,并且正在使用 AMD APP SDK 3.0 与 OpenCL 2.0 一起使用)