2

我刚开始通过 Python 的 PyOpenCL 接口使用 OpenCL。我试图创建一个非常简单的“循环”程序,其中每个内核中每个循环的结果取决于上一个循环周期中另一个内核的输出,但我遇到了同步问题:

__kernel void part1(__global float* a, __global float* c)
{
    unsigned int i = get_global_id(0);

    c[i] = 0;
    barrier(CLK_GLOBAL_MEM_FENCE);

    if (i < 9)
    {
        for(int t = 0; t < 2; t++){
            c[i] = c[i+1] + a[i];
            barrier(CLK_GLOBAL_MEM_FENCE);
       }
    }
}

主机应用程序是

import pyopencl as cl
from numpy import *

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

#read in the OpenCL source file as a string
f = open('recurrent.cl', 'r')
fstr = "".join(f.readlines())

#create the program
program = cl.Program(ctx, fstr).build()

mf = cl.mem_flags

#initialize client side (CPU) arrays
a = array(range(10), dtype=float32)

#create OpenCL buffers
a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY, a.nbytes)

#execute program
program.part1(queue, a.shape, None, a_buf, dest_buf)
c = empty_like(a)
cl.enqueue_read_buffer(queue, dest_buf, c).wait()

print "a", a
print "c", c

结果是

a [ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.]
c [  0.   1.   5.   3.   4.  18.  13.   7.   8.   0.]

如您所见,一些结果值是正确的。例如,第三个位置 = 5 = 3 + 2,但例如第二个位置是 2 = 0 + 2。因此,尽管存在障碍,但总和超过了其他线程在不同时间点的结果。我认为屏障会确保所有线程都到达它并将它们的结果写入全局内存?

这可能是非常简单的事情,我将不胜感激任何提示和评论!

PS:我正在使用 Intel SDK 在 Sandy Bridge CPU 上运行它。

4

2 回答 2

2

我想我现在有了答案。OpenCL 代码实际上完全没问题。然而,只有当所有线程都在一个工作组中时,障碍才会生效。情况并非如此,这很容易通过使用 get_local_id(0) 读出 local_id 来检查(如 Huseyin 所建议的)。在我的情况下,主机为每个线程创建了一个工作组 - 而不是将所有线程放在一个工作组中。性能方面,这是有道理的,比较

关于全球和本地工作规模的问题

然而,在我们的例子中,我们需要确保线程之间的数据是同步的,因此所有线程都应该在一个工作组中。为此我们需要改变程序1的执行,

program.part1(queue, a.shape, None, a_buf, dest_buf)

第二个参数指的是作业的 global_size(即创建的线程数),而第三个参数似乎指的是 local_size,即每个工作组的线程数。因此,这一行应为

program.part1(queue, a.shape, a.shape, a_buf, dest_buf)

这将创建一个包含所有线程的工作组(但请注意一个工作组中允许的最大工作人员大小!)。现在,代码仍然不起作用。最后一个问题与 OpenCL 代码中的障碍有关:id = 10 的最后一个线程在循环中看不到障碍,因此所有线程都在等待最后一个遇到障碍(尽管我想知道为什么没有'不抛出异常?)。所以我们只需要减少线程总数(去掉最后一个),

program.part1(queue, (a.shape[0]-1,), (a.shape[0]-1,), a_buf, dest_buf)

这样可行!在这个过程中吸取了一些教训...

再次感谢侯赛因!blue2script

于 2013-06-16T09:47:46.890 回答
0

Edit: user blue2script was right, it was an issue of "barrier not being hit by all local threads". On top of that, barrier can't synchronize between compute units / workgroups.

My answer doesn't add anything nor solve any problem here. So don't see the if in below kernel functions. It's wrong.


Incomplete

 __kernel void part1(__global float* a, __global float* c)
 {
      unsigned int i = get_global_id(0);

      c[i] = 0;
      barrier(CLK_GLOBAL_MEM_FENCE);

      if (i < 9)
      {
          for(int t = 0; t < 2; t++)
          {
              c[i] = c[i+1] + a[i];//c[i+1] is neighbour thread's variable
                                   //and there is no guarantee that
                                   //which one(ith or (i+1)st) computes first
                                   //so you need to get a copy of c[] first
              barrier(CLK_GLOBAL_MEM_FENCE);//thats why this line is not helping
          }
      }
 }

Using global

 __kernel void part1(__global float* a, __global float* c,__global float* d)
 {
      unsigned int i = get_global_id(0);

      c[i] = 0;
      d[i]=c[i]; 
      barrier(CLK_GLOBAL_MEM_FENCE);

      if (i < 9)
      {
          for(int t = 0; t < 2; t++)
          {
              d[i] = c[i+1] + a[i];//it is guaranteed that no neighbour thread can
                                   //change this threads d[i] element before/after
                                   //execution
              barrier(CLK_GLOBAL_MEM_FENCE);
              c[i]=d[i];
              barrier(CLK_GLOBAL_MEM_FENCE);
          }
      }
      barrier(CLK_GLOBAL_MEM_FENCE);

 }

Using locals(for workgroup size is 256 and total work size is a multiple of that):

 __kernel void part1(__global float* a, __global float* c)
 {
      unsigned int i = get_global_id(0);
      unsigned int Li=get_local_id(0);
      __local d[256];
      c[i] = 0;
      barrier(CLK_GLOBAL_MEM_FENCE);
      d[Li]=c[i]; 
      barrier(CLK_LOCAL_MEM_FENCE);

      if (i < 9)
      {
          for(int t = 0; t < 2; t++)
          {
              d[Li] = c[i+1] + a[i];//it is guaranteed that no neighbour thread can
                                   //change this threads d[i] element before/after
                                   //execution

             barrier(CLK_LOCAL_MEM_FENCE);
             c[i]=d[Li]; //guaranteed they dont interfere each other
             barrier(CLK_LOCAL_MEM_FENCE);
          }
      }

 }

Workgroup:

enter image description here

Using private

 __kernel void part1(__global float* a, __global float* c)
 {
      unsigned int i = get_global_id(0);
      unsigned int Li=get_local_id(0);
      __private f1;
      c[i] = 0;

      if (i < 9)
      {
          for(int t = 0; t < 2; t++)
          {
              f1 = c[i+1] + a[i];

             barrier(CLK_GLOBAL_MEM_FENCE);
             c[i]=f1; //guaranteed they dont interfere each other
             barrier(CLK_GLOBAL_MEM_FENCE);
          }
      }

 }
于 2013-06-15T20:41:08.457 回答