我正在尝试在 OpenCL 中实现卷积算法(使用 Vivado HLS)。我正在尝试在执行工作组之前将部分图像加载到本地内存中(例如,如果工作组为 128*128,卷积滤波器为 5*5,我将加载 132*132 像素)。如何编写内核以使本地内存仅在工作组启动时加载一次?
伪代码:
#define WKGRP_W 128
#define WKGRP_H 128
#define FILTER_SIZE 5
#define BUFFER_W WKGRP_W+FILTER_SIZE-1
#define BUFFER_H WKGRP_H+FILTER_SIZE-1
__kernel void __attribute__ ((reqd_work_group_size(WKGRP_W, WKGRP_H, 1)))
convolve(
const __global data_t* input,
__global data_t* output,
__constant data_t* filter_params
){
__local data_t img_buffer[BUFFER_H][BUFFER_W];
__local data_t output_buffer[WKGRP_H][WKGRP_W];
/**
* if (the workgroup is starting) {
* load data from input into img_buffer
* }
*/
filter(img_buffer, filter_params, get_local_id(0), get_local_id(1), output_buffer);
/**
* if (the workgroup is finished) {
* load data from output_buffer into output
* }
*/
}