performance - Tips for improving performance of a 2d image 'tracing' CUDA kernel?

Question

Can you give me some tips to optimize this CUDA code?

I'm running this on a device with compute capability 1.3 (I need it for a Tesla C1060 although I'm testing it now on a GTX 260 which has the same compute capability) and I have several kernels like the one below. The number of threads I need to execute this kernel is given by long SUM and depends on size_t M and size_t N which are the dimensions of a rectangular image received as parameter it can vary greatly from 50x50 to 10000x10000 in pixels or more. Although I'm mostly interested in working the bigger images with Cuda.

Now each image has to be traced in all directions and angles and some computations must be done over the values extracted from the tracing. So, for example, for a 500x500 image I need 229080 threads computing that kernel below which is the value of SUM (that's why I check that the thread id idHilo doesn't go over it). I copied several arrays into the global memory of the device one after another since I need to access them for the calculations all of length SUM. Like this

cudaMemcpy(xb_cuda,xb_host,(SUM*sizeof(long)),cudaMemcpyHostToDevice);

cudaMemcpy(yb_cuda,yb_host,(SUM*sizeof(long)),cudaMemcpyHostToDevice);

...etc

So each value of every array can be accessed by one thread. All are done before the kernel calls. According to the Cuda Profiler on Nsight the highest memcopy duration is 246.016 us for a 500x500 image so that is not taking so long.

But the kernels like the one I copied below are taking too long for any practical use (3.25 seconds according to the Cuda profiler for the kernel below for a 500x500 image and 5.052 seconds for the kernel with the highest duration) so I need to see if I can optimize them.

I arrange the data this way

First the block dimension

dim3 dimBlock(256,1,1);

then the number of blocks per Grid

dim3 dimGrid((SUM+255)/256);

For a number of 895 blocks for a 500x500 image.

I'm not sure how to use coalescing and shared memory in my case or even if it's a good idea to call the kernel several times with different portions of the data. The data is independent one from the other so I could in theory call that kernel several times and not with the 229080 threads all at once if needs be.

Now take into account that the outer for loop

for(t=15;t<=tendbegin_cuda[idHilo]-15;t++){

depends on

tendbegin_cuda[idHilo]

the value of which depends on each thread but most threads have similar values for it.

According to the Cuda Profiler the Global Store Efficiency is of 0.619 and the Global Load Efficiency is 0.951 for this kernel. Other kernels have similar values .

Is that good? bad? how can I interpret those values? Sadly the devices of compute capability 1.3 don't provide other useful info for assessing the code like the Multiprocessor and Kernel Memory or Instruction analysis. The only results I get after the analysis is "Low Global Memory Store Efficiency" and "Low Global Memory Load Efficiency" but I'm not sure how I can optimize those.

void __global__ t21_trazo(long SUM,int cT, double Bn, size_t M, size_t N, float* imagen_cuda, double* vector_trazo_cuda, long* xb_cuda, long* yb_cuda, long* xinc_cuda, long* yinc_cuda, long* tbegin_cuda, long* tendbegin_cuda){

long xi;
long yi;
int t;
int k;
int a;
int ji;
long idHilo=blockIdx.x*blockDim.x+threadIdx.x;

int neighborhood[31];
int v=0;

if(idHilo<SUM){

    for(t=15;t<=tendbegin_cuda[idHilo]-15;t++){

        xi = xb_cuda[idHilo] + floor((double)t*xinc_cuda[idHilo]);
        yi = yb_cuda[idHilo] + floor((double)t*yinc_cuda[idHilo]);
        neighborhood[v]=floor(xi/Bn);
        ji=floor(yi/Bn);

        if(fabs((double)neighborhood[v]) < M && fabs((double)ji)<N)
        {
            if(tendbegin_cuda[idHilo]>30 && v==30){

                if(t==0)
                vector_trazo_cuda[20+idHilo*31]=0;

                for(k=1;k<=15;k++)
                vector_trazo_cuda[20+idHilo*31]=vector_trazo_cuda[20+idHilo*31]+fabs(imagen_cuda[ji*M+(neighborhood[v-(15+k)])]-
                            imagen_cuda[ji*M+(neighborhood[v-(15-k)])]);


                for(a=0;a<30;a++)
                neighborhood[a]=neighborhood[a+1];

                v=v-1;
            }

            v=v+1;
        }
    }
}

}

EDIT:

Changing the DP flops for SP flops only slightly improved the duration. Loop unrolling the inner loops practically didn't help.

score 1 · Accepted Answer

对于非结构化的答案，我只是要抛出一些通常有用的注释，并引用您的代码，以使其对其他人更有用。

算法更改始终是优化的第一要务。是否有另一种方法来解决需要较少数学/迭代/内存等的问题？

如果精度不是一个大问题，请使用浮点（或具有较新架构的半精度浮点）。当您短暂尝试时它不会对您的性能产生太大影响的部分原因是因为您仍在对浮点数据使用双精度计算（fabs 需要双精度，因此如果您使用浮点数，它会将您的浮点数转换为双精度, 做双重数学，返回一个双精度并转换为浮点数，使用 fabsf)。

如果您不需要使用 float 的绝对完整精度，请使用快速数学（编译器选项）。

乘法比除法快得多（特别是对于全精度/非快速数学）。在内核外部计算 1/var，然后在内核内部进行乘法运算而不是除法运算。

不知道它是否得到优化，但您应该使用增量和减量运算符。v=v-1；可以是 v--；等等

转换为 int 将截断为零。floor() 将向负无限截断。您可能不需要明确的 floor()，也不需要像上面那样用于 float 的 floorf()。当您将它用于整数类型的中间计算时，它们已经被截断。因此，您无缘无故地转换为双倍并返回。使用适当类型的函数（abs、fabs、fabsf 等）

if(fabs((double)neighborhood[v]) < M && fabs((double)ji)<N)
change to
if(abs(neighborhood[v]) < M && abs(ji)<N)

vector_trazo_cuda[20+idHilo*31]=vector_trazo_cuda[20+idHilo*31]+
    fabs(imagen_cuda[ji*M+(neighborhood[v-(15+k)])]-
        imagen_cuda[ji*M+(neighborhood[v-(15-k)])]);
change to 
vector_trazo_cuda[20+idHilo*31] +=
    fabsf(imagen_cuda[ji*M+(neighborhood[v-(15+k)])]-
        imagen_cuda[ji*M+(neighborhood[v-(15-k)])]);

.

xi = xb_cuda[idHilo] + floor((double)t*xinc_cuda[idHilo]);
change to
xi = xb_cuda[idHilo] + t*xinc_cuda[idHilo];

上面的行是不必要的复杂。本质上，您正在执行此操作，将 t 转换为 double，将 xinc_cuda 转换为 double 并乘以，将其取整（返回 double），将 xb_cuda 转换为 double 并添加，转换为 long。

新行将在更短的时间内存储相同的结果（也更好，因为如果您在前一种情况下超过 double 的精度，您将四舍五入到最接近的 2 次方）。 此外，这四行应该在 for 循环之外……如果它们不依赖于 t，则不需要重新计算它们。如果这将您的运行时间缩短 10-30 倍，我不会感到惊讶。

您的结构会导致大量全局内存读取，尝试从全局读取一次，处理本地内存上的计算，然后写入一次全局（如果可能的话）。

始终使用 -lineinfo 进行编译。使分析更容易，而且我无法评估任何开销（使用 0.1 到 10ms 执行时间范围内的内核）。

如果您受计算或内存限制，请使用分析器确定并相应地投入时间。

尽可能让编译器使用寄存器，这是一个很大的话题。

与往常一样，不要一次更改所有内容。我通过编译/测试输入了所有这些，所以我可能有一个错误。

score 0 · Accepted Answer

您可能同时运行了太多线程。当您运行正确数量的线程时，最佳性能似乎出现了：有足够多的线程保持忙碌，但又不会过多地分割每个同时线程可用的本地内存。

去年秋天，我构建了一个教程来研究使用 CUDA 和 CUDAFY 优化旅行商问题 (TSP)。即使问题领域不同，我在从已发布的算法中实现数倍加速所经历的步骤可能对指导您的努力很有用。教程和代码可在CUDA Tuning with CUDAFY 获得。

performance - Tips for improving performance of a 2d image 'tracing' CUDA kernel?

2 回答 2

Related

Reference