memory-management - Memory management in OpenCL

Question

When I started programming in OpenCL I used the following approach for providing data to my kernels:

cl_mem buff = clCreateBuffer(cl_ctx, CL_MEM_READ_WRITE, object_size, NULL, NULL);
clEnqueueWriteBuffer(cl_queue, buff, CL_TRUE, 0, object_size, (void *) object, NULL, NULL, NULL);

This obviously required me to partition my data in chunks, ensuring that each chunk would fit into the device memory. After performing the computations, I'd read out the data with clEnqueueReadBuffer(). However, at some point I realised I could just use the following line:

cl_mem buff = clCreateBuffer(cl_ctx, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, object_size, (void*) object, NULL);

When doing this, the partitioning of the data became obsolete. And to my surprise, I experienced a great boost in performance. That is something I don't understand. From what I got, when using a host pointer, the device memory is working as a cache, but all the data still needs to be copied to it for processing and then copied back to main memory once finished. How come using an explicit copy ( clEnqueRead/WriteBuffer ) is an order of magnitude slower, when in my mind it should be basically the same? Am I missing something?

Thanks.

score 3 · Accepted Answer

是的，您在 clEnqueueWriteBuffer 调用中缺少 CL_TRUE。这使得写操作阻塞，从而在复制时停止 CPU。使用主机指针，OpenCL 实现可以通过使其异步来“优化”副本，因此总体上性能更好。

请注意，这取决于 CL 实现，并且不能保证会更快/相等/更慢。

score 2 · Accepted Answer

在某些情况下，CPU 和 GPU 可以共享相同的物理 DRAM 内存。例如，如果内存块满足 CPU 和 GPU 对齐规则，则 Intel 将 CL_MEM_USE_HOST_PTR 解释为允许在 CPU 和 GPU 之间共享物理 DRAM，因此没有实际的数据复制。显然，这非常快！

这是一个解释它的链接：

https://software.intel.com/en-us/articles/getting-the-most-from-opencl-12-how-to-increase-performance-by-minimizing-buffer-copies-on-intel-processor-图形

PS我知道我的回复对于OP来说太旧了，但其他读者可能会感兴趣。

memory-management - Memory management in OpenCL

2 回答 2

Related

Reference