1

I have a kernel and want to store some global data into local buffer to speed it up. I already got fine results and lower execution time. To decrease it further I decided to modify kernel to read these values in one stream.

Old kernel:

__local short3 (*lixArr) = (__local short3(*)) (lixInLocal + get_local_id(1) * 3);
__global short (*lab)[TS][TS][3] = (__global short(*)[TS][TS][3]) (buffer + TS*TS*12 + TS*TS*26*get_global_id(2));
__global short3 (*lix) = (__global short3(*)) (lab[0][get_global_id(0)+2][1+get_global_id(1)]);

lixArr[1] = lix[0].xyz;
lixArr[2] = lix[-TS].xyz;
lixArr[3] = lix[TS].xyz;

if(get_local_id(1) == 0)
    lixArr[0] = lix[-1].xyz;
else if(get_local_size(1) == get_local_id(1)+1)
    lixArr[3+1] = lix[1].xyz;

lix = (__global short(*)) (lab[1][get_global_id(0)+2][1+get_global_id(1)]);
lixArr += (get_local_size(1))*3 + 2;

lixArr[1] = lix[0].xyz;
lixArr[2] = lix[-TS].xyz;
lixArr[3] = lix[TS].xyz;

if(get_local_id(1) == 0)
    lixArr[0] = lix[-1].xyz;
else if(get_local_size(1) == get_local_id(1)+1)
    lixArr[3+1] = lix[1].xyz;

barrier(CLK_LOCAL_MEM_FENCE);

Modified stream reads:

__global short (*lab)[TS][TS][3] = (__global short(*)[TS][TS][3]) (buffer + TS*TS*12 + TS*TS*26*get_global_id(2));
__global short3 (*lix) = (__global short3(*)) (lab[(get_local_id(1) % 6 > 2)][get_global_id(0)+2][1+(get_local_id(1)/3)*6]);

__global short16 (*lix16);
__local short16 (*lixLocal16);


lix16 = (__global short16(*)) (&lix[(get_local_id(1) % 3 == 0) ? 0 : (get_local_id(1) % 3 == 1) ? -TS : TS]);
lixLocal16 = (__local short16(*)) (lixInLocal + ((get_local_id(1) % 6) * (get_local_size(1)+2) + 1 + (get_local_id(1) /6)*6));

lixLocal16[0] = lix16[0].s0123456789ABCDEF;
lixLocal16[1].s01 = lix16[1].s01;

if((get_local_id(1) % 3 + get_local_id(1) / 6) == 0)
    lixInLocal[(get_local_id(1) % 6) * (get_local_size(1)+2)] = lix[-1].xyz;
else if(get_local_id(1) % 3 == 0 && get_local_id(1)+6 >= get_local_size(1))
    lixInLocal[(get_local_id(1) % 6) * (get_local_size(1)+2)-1] = lix[1].xyz;   


mem_fence(CLK_LOCAL_MEM_FENCE);

In first kernel, as you see, I read distant values. In second case I modified the kernel to read 18 consequent values instead of 6*3 separated.

One thing that may be important - In the 1st case read values are later all used. In 2nd only 1 is used, others are for another WorkItems.

  • In 1st case the kernel execution took average 28.5ms.
  • In 2nd case execution takes 96ms!

I also tried to use vstoren and vloadn, but result was even worse.

Final question: Why these consequent reads increase execution time so rapidly, when they should speed things up?

4

0 回答 0