我在这里问这个是因为我认为我已经了解 OpenCL 的工作原理,但是......我认为有几件事我不明白。


double[] arrA = new double[]{1,2,3}
double[] arrB = new double[]{6,7,8}

dx1 = 1 - 1; dx2 = 2 - 1; dx3 = 3 - 1, dx4= 1 - 2;... dxLast = 3 - 3
dy1 = 6 - 6; dy2 = 7 - 6; dy3 = 8 - 6, dy4= 6 - 7;... dyLast = 8 - 8

(Extreme dx and dy will get 0, but i don't care about ignoring those cases by now)

然后根据 hypot(dx(i), dy(i)) 计算每个 hypot 并且一旦获得所有这些值,就得到最大的 hypot 值


String programSource =
    "#ifdef cl_khr_fp64 \n"
+ "   #pragma OPENCL EXTENSION cl_khr_fp64 : enable \n"
+ "#elif defined(cl_amd_fp64) \n"
+ "   #pragma OPENCL EXTENSION cl_amd_fp64 : enable \n"
+ "#else "
+ "   #error Double precision floating point not supported by OpenCL implementation.\n"
+ "#endif \n"
+ "__kernel void "
+ "sampleKernel(__global const double *bufferX,"
+ "             __global const double *bufferY,"
+ "             __local double* scratch,"
+ "             __global double* result,"
+ "             __const int lengthX,"
+ "             __const int lengthY){"
+ "    const int index_a = get_global_id(0);"//Get the global indexes for 2D reference
+ "    const int index_b = get_global_id(1);"
+ "    const int local_index = get_local_id(0);"//Current thread id -> Should be the same as index_a * index_b + index_b;
+ "    if (local_index < (lengthX * lengthY)) {"// Load data into local memory
+ "       if(index_a < lengthX && index_b < lengthY)"
+ "       {"
+ "           double dx = (bufferX[index_b] - bufferX[index_a]);"
+ "           double dy = (bufferY[index_b] - bufferY[index_a]);"
+ "           scratch[local_index] = hypot(dx, dy);"
+ "       }"
+ "    } "
+ "    else {"
+ "       scratch[local_index] = 0;"// Infinity is the identity element for the min operation
+ "    }"
//Make a Barrier to make sure all values were set into the local array
+ "    barrier(CLK_LOCAL_MEM_FENCE);"
//If someone can explain to me the offset thing I'll really apreciate that...
//I just know there is alway a division by 2
+ "    for(int offset = get_local_size(0) / 2; offset > 0; offset >>= 1) {"
+ "       if (local_index < offset) {"
+ "          float other = scratch[local_index + offset];"
+ "          float mine = scratch[local_index];"
+ "          scratch[local_index] = (mine > other) ? mine : other;"
+ "       }"
+ "       barrier(CLK_LOCAL_MEM_FENCE);"
//A barrier to make sure that all values where checked
+ "    }"
+ "    if (local_index == 0) {"
+ "       result[get_group_id(0)] = scratch[0];"
+ "    }"
+ "}";

对于这种情况,定义的 GWG 大小为 (100, 100, 0),LWI 大小为 (10, 10, 0)。

所以,对于这个例子,两个数组的大小都是 10,GWG 和 LWI 的获得如下:

//clGetKernelWorkGroupInfo(kernel, device, CL.CL_KERNEL_WORK_GROUP_SIZE, Sizeof.size_t, Pointer.to(buffer), null);
long kernel_work_group_size = OpenClUtil.getKernelWorkGroupSize(kernel, device.getCl_device_id(), 3);
//clGetDeviceInfo(device, CL_DEVICE_MAX_WORK_ITEM_SIZES, Sizeof.size_t * numValues, Pointer.to(buffer), null);
long[] maxSize = device.getMaximumSizes();

maxSize[0] = ( kernel_work_group_size > maxSize[0] ? maxSize[0] : kernel_work_group_size);
maxSize[1] = ( kernel_work_group_size > maxSize[1] ? maxSize[1] : kernel_work_group_size);
maxSize[2] = ( kernel_work_group_size > maxSize[2] ? maxSize[2] : kernel_work_group_size);
//    maxSize[2] = 

long xMaxSize = (x > maxSize[0] ? maxSize[0] : x);
long yMaxSize = (y > maxSize[1] ? maxSize[1] : y);
long zMaxSize = (z > maxSize[2] ? maxSize[2] : z);

long local_work_size[] = new long[] { xMaxSize, yMaxSize, zMaxSize };

int numWorkGroupsX = 0;
int numWorkGroupsY = 0;
int numWorkGroupsZ = 0;

if(local_work_size[0] != 0)
  numWorkGroupsX = (int) ((total + local_work_size[0] - 1) / local_work_size[0]);

if(local_work_size[1] != 0)
  numWorkGroupsY = (int) ((total + local_work_size[1] - 1) / local_work_size[1]);

if(local_work_size[2] != 0)
  numWorkGroupsZ = (int) ((total + local_work_size[2] - 1) / local_work_size[2]);

long global_work_size[] = new long[] { numWorkGroupsX * local_work_size[0],
    numWorkGroupsY * local_work_size[1], numWorkGroupsZ *  local_work_size[2]};

问题是我没有得到预期的值,所以我决定基于较小的内核进行一些测试并更改结果数组中返回的 [VARIABLE TO TEST VALUES] 对象:

* The source code of the OpenCL program to execute
private static String programSourceA =
    "#ifdef cl_khr_fp64 \n"
+ "   #pragma OPENCL EXTENSION cl_khr_fp64 : enable \n"
+ "#elif defined(cl_amd_fp64) \n"
+ "   #pragma OPENCL EXTENSION cl_amd_fp64 : enable \n"
+ "#else "
+ "   #error Double precision floating point not supported by OpenCL implementation.\n"
+ "#endif \n"
+ "__kernel void "
+ "sampleKernel(__global const double *bufferX,"
+ "             __global const double *bufferY,"
+ "             __local double* scratch,"
+ "             __global double* result,"
+ "             __const int lengthX,"
+ "             __const int lengthY){"
//Get the global indexes for 2D reference
+ "    const int index_a = get_global_id(0);"
+ "    const int index_b = get_global_id(1);"
//Current thread id -> Should be the same as index_a * index_b + index_b;
+ "    const int local_index = get_local_id(0);"
// Load data into local memory
//Only print values if index_a < ArrayA length
//Only print values if index_b < ArrayB length
//Only print values if local_index < (lengthX * lengthY)
//Only print values if this is the first work group.
+ "    if (local_index < (lengthX * lengthY)) {"
+ "       if(index_a < lengthX && index_b < lengthY)"
+ "       {"
+ "           double dx = (bufferX[index_b] - bufferX[index_a]);"
+ "           double dy = (bufferY[index_b] - bufferY[index_a]);"
+ "           result[local_index] = hypot(dx, dy);"
+ "       }"
+ "    } "
+ "    else {"
// Infinity is the identity element for the min operation
+ "       result[local_index] = 0;"
+ "    }"

返回的值远非预期,但如果 [VARIABLE TO TEST VALUES] 为 (index_a * index_b) + index_a,则返回数组的几乎每个值都具有正确的 (index_a * index_b) + index_a 值,我的意思是:

result[0] -> 0
result[1] -> 1
result[2] -> 2
result[97] -> 97
result[98] -> 98
result[99] -> 99







谢谢你的回答,首先这个内核代码是基于这里解释的交换缩减代码:http: //developer.amd.com/resources/documentation-articles/articles-whitepapers/opencl-optimization-case-study-简单减少/ . 所以我正在使用该代码,但我添加了一些诸如 2D 操作之类的东西。


1.1- 实际上全局工作组大小是 (100, 100, 0)... 100 是乘以 10 x 10 的结果,其中 10 是当前数组大小,所以我的全局工作组大小基于此规则.. . 则本地工作项大小为 (10, 10, 0)。全局工作组大小必须是本地工作项大小的倍数,我在很多例子中都读过这个,我认为这没问题。

1.2- 在我的测试代码中,我使用相同的数组,事实上,如果我改变数组大小 GWG 大小和 LWI 大小将动态变化。



3.1- 是的,我知道,但我意识到我没有使用 Y 轴 ID,所以这里可能还有另一个问题。




您的代码中有很多问题,其中一些与概念相关。我认为您应该在开始编码之前完整阅读标准或OpenCL 指南。因为您使用的某些系统调用具有与您期望的不同的行为。

  1. 工作组和工作项不像 CUDA。如果你想要 100x100 的工作项,分成 10x10 的工作组,用作全局尺寸 (100x100) 和本地尺寸 (10x10)。与 CUDA 不同,其中全局工作项在内部乘以本地大小。

    1.1。在您的测试代码中,如果您使用 10x10 和 10x10。然后你没有填满整个空间,未填充的区域仍然会有垃圾-X.xxxxxE-308

  2. 你不应该使用 lengthX 和 lengthY 并且在你的代码中放很多 if。OpenCL 有一种方法可以使用偏移量和特定数量的项目调用内核,因此您可以从主机端进行控制。顺便说一句,这样做会造成性能损失,而且从来都不是一个好的做法,因为代码的可读性较差。

  3. get_local_size(0)为您提供轴 0 的本地大小(在您的情况下为 10)。在这个电话中你有什么不明白的地方?为什么总是将其除以 2?


