Question: In CUDA, is there a general way of improving the performance of nested for-loops whose conditions are determined at runtime (and therefore can't be unrolled by the compiler)?
Background: I am working on a CUDA implementation of a 2D image filter algorithm. For each pixel of the input the value of the output is calculated by looking at the (2*r+1) * (2*r+1) neighbouring pixels. Although r is constant for each image, the shape of the filter is dependent on the value at each pixel and hence it can't be converted into a true convolution operation or decomposed into two 1D operations.
I have a fairly efficient implementation for when the filter radius r is known at compile time, based on a scatter approach (which is faster than any gather approach I could come up with) where each pixel in the input is assigned a thread. The output is divided into tiles that are kept in shared memory. At the heart of the algorithm is a nested for-loop executed by each thread:
for(int i=-r; i<r+1; i++) {
for(int j=-r; j<r+1; j++) {
// Calculate and scatter value to output[offsetJ + j][offsetI + i]
}
}
I have generalised the code for r given at runtime by using dynamically allocated shared memory. Although the generated result is still correct, execution is between 1.5 to 3 times slower depending on the value of r. Through tests I have concluded that the slow-down is due to the runtime determination of the conditions of the above for-loops, meaning the compiler can't unroll the loops which is what I assume is otherwise done.
If anyone has any suggestions on how to improve the performance in this particular case or knows of a similar implementation tips are welcome. My only ideas so far is either to compile different kernels for each value of r, or to get rid of the inner loop (but not sure how this would help).