image-processing - Improve performance of runtime-determinded nested for-loops in CUDA

Question

Question: In CUDA, is there a general way of improving the performance of nested for-loops whose conditions are determined at runtime (and therefore can't be unrolled by the compiler)?

Background: I am working on a CUDA implementation of a 2D image filter algorithm. For each pixel of the input the value of the output is calculated by looking at the (2*r+1) * (2*r+1) neighbouring pixels. Although r is constant for each image, the shape of the filter is dependent on the value at each pixel and hence it can't be converted into a true convolution operation or decomposed into two 1D operations.

I have a fairly efficient implementation for when the filter radius r is known at compile time, based on a scatter approach (which is faster than any gather approach I could come up with) where each pixel in the input is assigned a thread. The output is divided into tiles that are kept in shared memory. At the heart of the algorithm is a nested for-loop executed by each thread:

for(int i=-r; i<r+1; i++) {
    for(int j=-r; j<r+1; j++) {
        // Calculate and scatter value to output[offsetJ + j][offsetI + i]
    }
}

I have generalised the code for r given at runtime by using dynamically allocated shared memory. Although the generated result is still correct, execution is between 1.5 to 3 times slower depending on the value of r. Through tests I have concluded that the slow-down is due to the runtime determination of the conditions of the above for-loops, meaning the compiler can't unroll the loops which is what I assume is otherwise done.

If anyone has any suggestions on how to improve the performance in this particular case or knows of a similar implementation tips are welcome. My only ideas so far is either to compile different kernels for each value of r, or to get rid of the inner loop (but not sure how this would help).

score 0 · Accepted Answer

从评论中收集到的唯一选项似乎是手动展开（在这种情况下可能不适用）、模板的使用和运行时代码生成。在这种特殊情况下，最好的选择是r在 1 - 25 的范围内，似乎是为一些不同的情况创建显式模板，并用零填充中间值。由于复杂度随二次增长r，如果的所有值r都相同，则在较高端更密集地采样范围似乎是合理的，例如为r等于 8、14、18、21、23 和 25 创建模板。

image-processing - Improve performance of runtime-determinded nested for-loops in CUDA

1 回答 1

Related

Reference