Question: In CUDA, is there a general way of improving the performance of nested for-loops whose conditions are determined at runtime (and therefore can't be unrolled by the compiler)?
Background: I am working on a CUDA implementation of a 2D image filter algorithm. For each pixel of the input the value of the output is calculated by looking at the (2*r+1) * (2*r+1)
neighbouring pixels. Although r
is constant for each image, the shape of the filter is dependent on the value at each pixel and hence it can't be converted into a true convolution operation or decomposed into two 1D operations.
I have a fairly efficient implementation for when the filter radius r
is known at compile time, based on a scatter approach (which is faster than any gather approach I could come up with) where each pixel in the input is assigned a thread. The output is divided into tiles that are kept in shared memory. At the heart of the algorithm is a nested for-loop executed by each thread:
for(int i=-r; i<r+1; i++) {
for(int j=-r; j<r+1; j++) {
// Calculate and scatter value to output[offsetJ + j][offsetI + i]
}
}
I have generalised the code for r
given at runtime by using dynamically allocated shared memory. Although the generated result is still correct, execution is between 1.5 to 3 times slower depending on the value of r
. Through tests I have concluded that the slow-down is due to the runtime determination of the conditions of the above for-loops, meaning the compiler can't unroll the loops which is what I assume is otherwise done.
If anyone has any suggestions on how to improve the performance in this particular case or knows of a similar implementation tips are welcome. My only ideas so far is either to compile different kernels for each value of r
, or to get rid of the inner loop (but not sure how this would help).