max - OpenCL (JOCL) - 内核中两个数组的二维微积分

Question

我在这里问这个是因为我认为我已经了解 OpenCL 的工作原理，但是......我认为有几件事我不明白。

我想要做的是得到两个数组的所有值之间的差异，然后计算hypot，最后得到最大hypot值，所以如果我有：

double[] arrA = new double[]{1,2,3}
double[] arrB = new double[]{6,7,8}

Calculate
dx1 = 1 - 1; dx2 = 2 - 1; dx3 = 3 - 1, dx4= 1 - 2;... dxLast = 3 - 3
dy1 = 6 - 6; dy2 = 7 - 6; dy3 = 8 - 6, dy4= 6 - 7;... dyLast = 8 - 8

(Extreme dx and dy will get 0, but i don't care about ignoring those cases by now)

然后根据 hypot(dx(i), dy(i)) 计算每个 hypot 并且一旦获得所有这些值，就得到最大的 hypot 值

所以，我有下一个定义的内核：

String programSource =
    "#ifdef cl_khr_fp64 \n"
+ "   #pragma OPENCL EXTENSION cl_khr_fp64 : enable \n"
+ "#elif defined(cl_amd_fp64) \n"
+ "   #pragma OPENCL EXTENSION cl_amd_fp64 : enable \n"
+ "#else "
+ "   #error Double precision floating point not supported by OpenCL implementation.\n"
+ "#endif \n"
+ "__kernel void "
+ "sampleKernel(__global const double *bufferX,"
+ "             __global const double *bufferY,"
+ "             __local double* scratch,"
+ "             __global double* result,"
+ "             __const int lengthX,"
+ "             __const int lengthY){"
+ "    const int index_a = get_global_id(0);"//Get the global indexes for 2D reference
+ "    const int index_b = get_global_id(1);"
+ "    const int local_index = get_local_id(0);"//Current thread id -> Should be the same as index_a * index_b + index_b;
+ "    if (local_index < (lengthX * lengthY)) {"// Load data into local memory
+ "       if(index_a < lengthX && index_b < lengthY)"
+ "       {"
+ "           double dx = (bufferX[index_b] - bufferX[index_a]);"
+ "           double dy = (bufferY[index_b] - bufferY[index_a]);"
+ "           scratch[local_index] = hypot(dx, dy);"
+ "       }"
+ "    } "
+ "    else {"
+ "       scratch[local_index] = 0;"// Infinity is the identity element for the min operation
+ "    }"
//Make a Barrier to make sure all values were set into the local array
+ "    barrier(CLK_LOCAL_MEM_FENCE);"
//If someone can explain to me the offset thing I'll really apreciate that...
//I just know there is alway a division by 2
+ "    for(int offset = get_local_size(0) / 2; offset > 0; offset >>= 1) {"
+ "       if (local_index < offset) {"
+ "          float other = scratch[local_index + offset];"
+ "          float mine = scratch[local_index];"
+ "          scratch[local_index] = (mine > other) ? mine : other;"
+ "       }"
+ "       barrier(CLK_LOCAL_MEM_FENCE);"
//A barrier to make sure that all values where checked
+ "    }"
+ "    if (local_index == 0) {"
+ "       result[get_group_id(0)] = scratch[0];"
+ "    }"
+ "}";

对于这种情况，定义的 GWG 大小为 (100, 100, 0)，LWI 大小为 (10, 10, 0)。

所以，对于这个例子，两个数组的大小都是 10，GWG 和 LWI 的获得如下：

//clGetKernelWorkGroupInfo(kernel, device, CL.CL_KERNEL_WORK_GROUP_SIZE, Sizeof.size_t, Pointer.to(buffer), null);
long kernel_work_group_size = OpenClUtil.getKernelWorkGroupSize(kernel, device.getCl_device_id(), 3);
//clGetDeviceInfo(device, CL_DEVICE_MAX_WORK_ITEM_SIZES, Sizeof.size_t * numValues, Pointer.to(buffer), null);
long[] maxSize = device.getMaximumSizes();

maxSize[0] = ( kernel_work_group_size > maxSize[0] ? maxSize[0] : kernel_work_group_size);
maxSize[1] = ( kernel_work_group_size > maxSize[1] ? maxSize[1] : kernel_work_group_size);
maxSize[2] = ( kernel_work_group_size > maxSize[2] ? maxSize[2] : kernel_work_group_size);
//    maxSize[2] = 

long xMaxSize = (x > maxSize[0] ? maxSize[0] : x);
long yMaxSize = (y > maxSize[1] ? maxSize[1] : y);
long zMaxSize = (z > maxSize[2] ? maxSize[2] : z);

long local_work_size[] = new long[] { xMaxSize, yMaxSize, zMaxSize };

int numWorkGroupsX = 0;
int numWorkGroupsY = 0;
int numWorkGroupsZ = 0;

if(local_work_size[0] != 0)
  numWorkGroupsX = (int) ((total + local_work_size[0] - 1) / local_work_size[0]);

if(local_work_size[1] != 0)
  numWorkGroupsY = (int) ((total + local_work_size[1] - 1) / local_work_size[1]);

if(local_work_size[2] != 0)
  numWorkGroupsZ = (int) ((total + local_work_size[2] - 1) / local_work_size[2]);

long global_work_size[] = new long[] { numWorkGroupsX * local_work_size[0],
    numWorkGroupsY * local_work_size[1], numWorkGroupsZ *  local_work_size[2]};

问题是我没有得到预期的值，所以我决定基于较小的内核进行一些测试并更改结果数组中返回的 [VARIABLE TO TEST VALUES] 对象：

/**
* The source code of the OpenCL program to execute
*/
private static String programSourceA =
    "#ifdef cl_khr_fp64 \n"
+ "   #pragma OPENCL EXTENSION cl_khr_fp64 : enable \n"
+ "#elif defined(cl_amd_fp64) \n"
+ "   #pragma OPENCL EXTENSION cl_amd_fp64 : enable \n"
+ "#else "
+ "   #error Double precision floating point not supported by OpenCL implementation.\n"
+ "#endif \n"
+ "__kernel void "
+ "sampleKernel(__global const double *bufferX,"
+ "             __global const double *bufferY,"
+ "             __local double* scratch,"
+ "             __global double* result,"
+ "             __const int lengthX,"
+ "             __const int lengthY){"
//Get the global indexes for 2D reference
+ "    const int index_a = get_global_id(0);"
+ "    const int index_b = get_global_id(1);"
//Current thread id -> Should be the same as index_a * index_b + index_b;
+ "    const int local_index = get_local_id(0);"
// Load data into local memory
//Only print values if index_a < ArrayA length
//Only print values if index_b < ArrayB length
//Only print values if local_index < (lengthX * lengthY)
//Only print values if this is the first work group.
+ "    if (local_index < (lengthX * lengthY)) {"
+ "       if(index_a < lengthX && index_b < lengthY)"
+ "       {"
+ "           double dx = (bufferX[index_b] - bufferX[index_a]);"
+ "           double dy = (bufferY[index_b] - bufferY[index_a]);"
+ "           result[local_index] = hypot(dx, dy);"
+ "       }"
+ "    } "
+ "    else {"
// Infinity is the identity element for the min operation
+ "       result[local_index] = 0;"
+ "    }"

返回的值远非预期，但如果 [VARIABLE TO TEST VALUES] 为 (index_a * index_b) + index_a，则返回数组的几乎每个值都具有正确的 (index_a * index_b) + index_a 值，我的意思是：

result[0] -> 0
result[1] -> 1
result[2] -> 2
....
result[97] -> 97
result[98] -> 98
result[99] -> 99

但有些值为：-3.350700319577517E-308....

我做错了什么？？？

我希望这能得到很好的解释，不要让你生我的气……

太感谢了！！！！！

汤姆赛车手

score 0 · Accepted Answer

谢谢你的回答，首先这个内核代码是基于这里解释的交换缩减代码：http: //developer.amd.com/resources/documentation-articles/articles-whitepapers/opencl-optimization-case-study-简单减少/ . 所以我正在使用该代码，但我添加了一些诸如 2D 操作之类的东西。

关于你之前提到的一点：

1.1- 实际上全局工作组大小是 (100, 100, 0)... 100 是乘以 10 x 10 的结果，其中 10 是当前数组大小，所以我的全局工作组大小基于此规则.. . 则本地工作项大小为 (10, 10, 0)。全局工作组大小必须是本地工作项大小的倍数，我在很多例子中都读过这个，我认为这没问题。

1.2- 在我的测试代码中，我使用相同的数组，事实上，如果我改变数组大小 GWG 大小和 LWI 大小将动态变化。

2.1-那里没有那么多“如果”，只有3个“如果”，第一个检查我何时必须根据数组对象计算hypot（）或用零填充该对象。第二个和第三个“如果”只是减少算法的一部分，似乎很好。

2.2-关于lengthX和lengthY是的，你是对的，但我还没有，我应该如何使用它？

3.1- 是的，我知道，但我意识到我没有使用 Y 轴 ID，所以这里可能还有另一个问题。

3.2-减少算法迭代存储在临时变量中的每对元素并检查它们之间的最大值，因此对于它所做的每个“for”，它会将要计算的元素减少到前一个数量的一半。

此外，我将在主内核代码和测试内核代码中发布一些更改，因为那里有一些错误。

问候...！！！

score 0 · Accepted Answer

您的代码中有很多问题，其中一些与概念相关。我认为您应该在开始编码之前完整阅读标准或OpenCL 指南。因为您使用的某些系统调用具有与您期望的不同的行为。

工作组和工作项不像 CUDA。如果你想要 100x100 的工作项，分成 10x10 的工作组，用作全局尺寸 (100x100) 和本地尺寸 (10x10)。与 CUDA 不同，其中全局工作项在内部乘以本地大小。

1.1。在您的测试代码中，如果您使用 10x10 和 10x10。然后你没有填满整个空间，未填充的区域仍然会有垃圾-X.xxxxxE-308。
你不应该使用 lengthX 和 lengthY 并且在你的代码中放很多 if。OpenCL 有一种方法可以使用偏移量和特定数量的项目调用内核，因此您可以从主机端进行控制。顺便说一句，这样做会造成性能损失，而且从来都不是一个好的做法，因为代码的可读性较差。
get_local_size(0)为您提供轴 0 的本地大小（在您的情况下为 10）。在这个电话中你有什么不明白的地方？为什么总是将其除以 2？

我希望这可以在您的调试过程中帮助您。干杯

max - OpenCL (JOCL) - 内核中两个数组的二维微积分

2 回答 2

Related

Reference