opencl - OpenCL (aparapi) simple reduction slow on Radeon

Question

I am trying to code a simple reduction (in this case a sum) over a large double array in OpenCL. I have looked at online tutorials and found that this is essentially the way to solve my problem:

#pragma OPENCL EXTENSION cl_khr_fp64 : enable

typedef struct This_s{
   __global double *nums;
   int nums__javaArrayLength;
   __local double *buffer;
   __global double *res;
   int passid;
}This;
int get_pass_id(This *this){
   return this->passid;
}
__kernel void run(
   __global double *nums, 
   int nums__javaArrayLength, 
   __local double *buffer, 
   __global double *res, 
   int passid
){
   This thisStruct;
   This* this=&thisStruct;
   this->nums = nums;
   this->nums__javaArrayLength = nums__javaArrayLength;
   this->buffer = buffer;
   this->res = res;
   this->passid = passid;
   {
      int tid = get_local_id(0);
      int i = (get_group_id(0) * get_local_size(0)) + get_local_id(0);
      int gridSize = get_local_size(0) * get_num_groups(0);
      int n = this->nums__javaArrayLength;
      double cur = 0.0;
      for (; i<n; i = i + gridSize){
         cur = cur + this->nums[i];
      }
      this->buffer[tid]  = cur;
      barrier(CLK_LOCAL_MEM_FENCE);
      barrier(CLK_LOCAL_MEM_FENCE);
      if (tid<32){
         this->buffer[tid]  = this->buffer[tid] + this->buffer[(tid + 32)];
      }
      barrier(CLK_LOCAL_MEM_FENCE);
      if (tid<16){
         this->buffer[tid]  = this->buffer[tid] + this->buffer[(tid + 16)];
      }
      barrier(CLK_LOCAL_MEM_FENCE);
      if (tid<8){
         this->buffer[tid]  = this->buffer[tid] + this->buffer[(tid + 8)];
      }
      barrier(CLK_LOCAL_MEM_FENCE);
      if (tid<4){
         this->buffer[tid]  = this->buffer[tid] + this->buffer[(tid + 4)];
      }
      barrier(CLK_LOCAL_MEM_FENCE);
      if (tid<2){
         this->buffer[tid]  = this->buffer[tid] + this->buffer[(tid + 2)];
      }
      barrier(CLK_LOCAL_MEM_FENCE);
      if (tid<1){
         this->buffer[tid]  = this->buffer[tid] + this->buffer[(tid + 1)];
      }
      barrier(CLK_LOCAL_MEM_FENCE);
      if (tid==0){
         this->res[get_group_id(0)]  = this->buffer[0];
      }
      return;
   }
}

If you are wondering about the strange this, that is an (unfortunately necessary) artifact of aparapi, which I use to translate Java to OpenCL.

My kernel produces the correct results and, on reasonably beefy Nvidia Hardware, it is about 10x faster than a sequential sum in Java. On a Radeon R9 280 however it is comparable in performance to the simple Java code.

I have profiled the kernel with CodeXL. It tells me that MemUnitBusy is at just 6%. Why is it so low?

score 0 · Accepted Answer

事实证明 OpenCL 没有（直接）有问题，但是 aparapis 缓冲区管理是。

我在没有aparapi的情况下尝试了完全相同的内核，并且性能很好。我一使用它就变坏了CL_MEM_USE_HOST_PTR，可悲的是，这是使用 aparapi 时的唯一选择。似乎 AMD 并没有使用该选项将主机内存复制到设备，即使在几次“预热”运行之后也是如此。

score -1 · Accepted Answer

您可能想要考虑转移到aparapi.com上更活跃的项目。它包括对错误的几个修复以及与您上面链接的旧库相比的许多额外功能和性能增强。它也在 Maven 中心，有大约十几个版本。所以更容易使用。新的Github 存储库在这里。

opencl - OpenCL (aparapi) simple reduction slow on Radeon

2 回答 2

Related

Reference