如何对 2D 数组的许多重叠但偏移的块进行操作,以便在 OpenCL 中更有效地执行?
例如,我有以下 OpenCL 内核:
__kernel void test_kernel(
read_only image2d_t src,
write_only image2d_t dest,
const int width,
const int height
)
{
const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
int2 pos = (int2)(get_global_id(0), get_global_id(1));
int2 pos0 = (int2)(pos.x - pos.x % 16, pos.y - pos.y % 16);
uint4 diff = (uint4)(0, 0, 0, 0);
for (int i=0; i<16; i++)
{
for (int j=0; j<16; j++)
{
diff += read_imageui(src, sampler, (int2)(pos0.x + i, pos0.y + j)) -
read_imageui(src, sampler, (int2)(pos.x + i, pos.y + j));
}
}
write_imageui(dest, pos, diff);
}
它产生正确的结果,但速度很慢......在 NVS4200M 上只有 ~25 GFLOPS,输入为 1k x 1k。(硬件规格为 155 GFLOPS)。我猜这与内存访问模式有关。每个工作项读取一个 16x16 数据块,该数据块与 16x16 区域中的所有邻居相同,并且另一个偏移数据块大部分时间与其直接邻居的数据重叠。所有读取均通过采样器。主机程序是 PyOpenCL(我认为这实际上并没有改变任何东西),工作组大小是 16x16。
编辑:根据以下建议的新版本内核,将工作区复制到局部变量:
__kernel __attribute__((reqd_work_group_size(16, 16, 1)))
void test_kernel(
read_only image2d_t src,
write_only image2d_t dest,
const int width,
const int height
)
{
const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
int2 pos = (int2)(get_global_id(0), get_global_id(1));
int dx = pos.x % 16;
int dy = pos.y % 16;
__local uint4 local_src[16*16];
__local uint4 local_src2[32*32];
local_src[(pos.y % 16) * 16 + (pos.x % 16)] = read_imageui(src, sampler, pos);
local_src2[(pos.y % 16) * 32 + (pos.x % 16)] = read_imageui(src, sampler, pos);
local_src2[(pos.y % 16) * 32 + (pos.x % 16) + 16] = read_imageui(src, sampler, (int2)(pos.x + 16, pos.y));
local_src2[(pos.y % 16 + 16) * 32 + (pos.x % 16)] = read_imageui(src, sampler, (int2)(pos.x, pos.y + 16));
local_src2[(pos.y % 16 + 16) * 32 + (pos.x % 16) + 16] = read_imageui(src, sampler, (int2)(pos.x + 16, pos.y + 16));
barrier(CLK_LOCAL_MEM_FENCE);
uint4 diff = (uint4)(0, 0, 0, 0);
for (int i=0; i<16; i++)
{
for (int j=0; j<16; j++)
{
diff += local_src[ j*16 + i ] - local_src2[ (j+dy)*32 + i+dx ];
}
}
write_imageui(dest, pos, diff);
}
结果:输出正确,运行时间慢 56%。如果只使用 local_src(而不是 local_src2),结果会快 10%。
编辑:以更强大的硬件为基准,AMD Radeon HD 7850 获得 420GFLOPS,规格为 1751GFLOPS。公平地说,规范是针对乘加的,这里没有乘法,所以预期是~875GFLOPS,但这与理论性能相比仍然相差很多。
编辑:为了方便任何想尝试这个的人的运行测试,下面 PyOpenCL 中的主机端程序:
import pyopencl as cl
import numpy
import numpy.random
from time import time
CL_SOURCE = '''
// kernel goes here
'''
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx, properties=cl.command_queue_properties.PROFILING_ENABLE)
prg = cl.Program(ctx, CL_SOURCE).build()
h, w = 1024, 1024
src = numpy.zeros((h, w, 4), dtype=numpy.uint8)
src[:,:,:] = numpy.random.rand(h, w, 4) * 255
mf = cl.mem_flags
src_buf = cl.image_from_array(ctx, src, 4)
fmt = cl.ImageFormat(cl.channel_order.RGBA, cl.channel_type.UNSIGNED_INT8)
dest_buf = cl.Image(ctx, mf.WRITE_ONLY, fmt, shape=(w, h))
# warmup
for n in range(10):
event = prg.test_kernel(queue, (w, h), (16,16), src_buf, dest_buf, numpy.int32(w), numpy.int32(h))
event.wait()
# benchmark
t1 = time()
for n in range(100):
event = prg.test_kernel(queue, (w, h), (16,16), src_buf, dest_buf, numpy.int32(w), numpy.int32(h))
event.wait()
t2 = time()
print "Duration (host): ", (t2-t1)/100
print "Duration (event): ", (event.profile.end-event.profile.start)*1e-9
编辑:考虑内存访问模式,原来的幼稚版本可能还不错;当调用read_imageui(src, sampler, (int2)(pos0.x + i, pos0.y + j))
工作组中的所有工作项时正在读取相同的位置(所以这只是一次读取??),并且在调用时read_imageui(src, sampler, (int2)(pos.x + i, pos.y + j))
它们正在读取顺序位置(因此可以完美地合并读取??)。