I have a kernel and want to store some global data into local buffer to speed it up. I already got fine results and lower execution time. To decrease it further I decided to modify kernel to read these values in one stream.
Old kernel:
__local short3 (*lixArr) = (__local short3(*)) (lixInLocal + get_local_id(1) * 3);
__global short (*lab)[TS][TS][3] = (__global short(*)[TS][TS][3]) (buffer + TS*TS*12 + TS*TS*26*get_global_id(2));
__global short3 (*lix) = (__global short3(*)) (lab[0][get_global_id(0)+2][1+get_global_id(1)]);
lixArr[1] = lix[0].xyz;
lixArr[2] = lix[-TS].xyz;
lixArr[3] = lix[TS].xyz;
if(get_local_id(1) == 0)
lixArr[0] = lix[-1].xyz;
else if(get_local_size(1) == get_local_id(1)+1)
lixArr[3+1] = lix[1].xyz;
lix = (__global short(*)) (lab[1][get_global_id(0)+2][1+get_global_id(1)]);
lixArr += (get_local_size(1))*3 + 2;
lixArr[1] = lix[0].xyz;
lixArr[2] = lix[-TS].xyz;
lixArr[3] = lix[TS].xyz;
if(get_local_id(1) == 0)
lixArr[0] = lix[-1].xyz;
else if(get_local_size(1) == get_local_id(1)+1)
lixArr[3+1] = lix[1].xyz;
barrier(CLK_LOCAL_MEM_FENCE);
Modified stream reads:
__global short (*lab)[TS][TS][3] = (__global short(*)[TS][TS][3]) (buffer + TS*TS*12 + TS*TS*26*get_global_id(2));
__global short3 (*lix) = (__global short3(*)) (lab[(get_local_id(1) % 6 > 2)][get_global_id(0)+2][1+(get_local_id(1)/3)*6]);
__global short16 (*lix16);
__local short16 (*lixLocal16);
lix16 = (__global short16(*)) (&lix[(get_local_id(1) % 3 == 0) ? 0 : (get_local_id(1) % 3 == 1) ? -TS : TS]);
lixLocal16 = (__local short16(*)) (lixInLocal + ((get_local_id(1) % 6) * (get_local_size(1)+2) + 1 + (get_local_id(1) /6)*6));
lixLocal16[0] = lix16[0].s0123456789ABCDEF;
lixLocal16[1].s01 = lix16[1].s01;
if((get_local_id(1) % 3 + get_local_id(1) / 6) == 0)
lixInLocal[(get_local_id(1) % 6) * (get_local_size(1)+2)] = lix[-1].xyz;
else if(get_local_id(1) % 3 == 0 && get_local_id(1)+6 >= get_local_size(1))
lixInLocal[(get_local_id(1) % 6) * (get_local_size(1)+2)-1] = lix[1].xyz;
mem_fence(CLK_LOCAL_MEM_FENCE);
In first kernel, as you see, I read distant values. In second case I modified the kernel to read 18 consequent values instead of 6*3 separated.
One thing that may be important - In the 1st case read values are later all used. In 2nd only 1 is used, others are for another WorkItems.
- In 1st case the kernel execution took average 28.5ms.
- In 2nd case execution takes 96ms!
I also tried to use vstoren and vloadn, but result was even worse.
Final question: Why these consequent reads increase execution time so rapidly, when they should speed things up?