我有一个 OpenCL 内核,我想在不同系统上所有检测到的支持 OpenCL 的设备(如所有可用的 GPU)上运行它,我很高兴知道是否有任何简单的方法。我的意思是为所有设备创建一个命令队列。
提前致谢 :]
我有一个 OpenCL 内核,我想在不同系统上所有检测到的支持 OpenCL 的设备(如所有可用的 GPU)上运行它,我很高兴知道是否有任何简单的方法。我的意思是为所有设备创建一个命令队列。
提前致谢 :]
您不能为所有设备创建单个命令队列;给定的命令队列绑定到单个设备。但是,您可以为每个 OpenCL 设备创建单独的命令队列并为它们提供工作,这些工作应该同时执行。
As Dithermaster points out you first create a separate command queue for each device, for instance you might have multiple GPUs. You can then place these in an array, e.g., here is a pointer to an array that you can setup:
cl_command_queue* commandQueues;
However in my experience it has not always been a "slam-dunk" in getting the various command queues executing concurrently, as can be verified using event timing information (checking for overlap) which you can get through your own profiling or using 3rd party profiling tools. You should do this step anyway to verify what does or does not work on your setup.
An alternative approach which can work quite nicely is to use OpenMP to execute the command queues concurrently, e.g., you do something like:
#pragma omp parallel for default(shared)
for (int i = 0; i < numDevices; ++i) {
someOpenCLFunction(commandQueues[i], ....);
}
假设您有 N 台设备,以及 100 个工作元素(作业)。你应该做的是这样的:
#define SIZE 3
std::vector<cl::Commandqueue> queues(SIZE); //One queue for each device (same context)
std::vector<cl::Kernel> kernels(SIZE); //One kernel for each device (same context)
std::vector<cl::Buffer> buf_in(SIZE), buf_out(SIZE); //One buffer set for each device (same context)
// Initialize the queues, kernels, buffers etc....
//Create the kernel, buffers and queues, then set the kernel[0] args to point to buf_in[0] and buf_out[0], and so on...
// Create the events in a finished state
std::vector<cl::Event> events;
cl::UserEvent ev; ev.setStatus(CL_COMPLETE);
for(int i=0; i<queues.size(); i++)
events.push_back(ev);
//Run all the elements (a "first empty, first run" scheduler)
for(int i=0; i<jobs.size(); i++){
bool found = false;
int x = -1;
//Try all the queues
while(!found){
for(int j=0; j<queue.size(); j++)
if(events[j].getInfo<CL_EVENT_COMMAND_ EXECUTION_STATUS>() == CL_COMPLETED){
found = true;
x = j;
break;
}
if(!found) Sleep(50); //Sleep a while if not all the queues have completed, other options are possible (like asigning the job to a random one)
}
//Run it
events[x] = cl::Event(); //Clean it
queues[x].enqueueWriteBuffer(...); //Copy buf_in
queues[x].enqueueNDRangeKernel(kernel[x], .... ); //Launch the kernel
queues[x].enqueueReadBuffer(... , events[x]); //Read buf_out
}
//Wait for completion
for(int i=0; i<queues.size(); i++)
queue[i].Finish();