c++ - CUDA parallel computing speed up volume calculation

Question

Device : Tesla C2050
OS : Windows 7 Enterprise
IDE : VS 2010
CUDA : 5.0 (newest)

First time to ask question here. I met some problems in my CUDA program.

I have millions tetrahedrons with one point at (0,0,0), so I can use the formula:

to get the volume of the tetrahedrons.

So , here is the code:

struct Triangle
{
    double x1;
    double y1;
    double z1;
    double x2;
    double y2;
    double z2;
    double x3;
    double y3;
    double z3;
};

And the CUDA code:

__global__ void getResult(double *d_volume ,Triangle *d_triangles, Origin *d_point)
{
    extern __shared__ Triangle s_data[];
    int tid = threadIdx.x;
    int i =  blockDim.x * blockIdx.x + threadIdx.x;
    s_data[tid] = d_triangles[i];
    __syncthreads();
    d_volume[i] =s_data[tid].x1 * s_data[tid].y2 * s_data[tid].z3 + \
                s_data[tid].y1 * s_data[tid].z2 * s_data[tid].x3 + \
                s_data[tid].x2 * s_data[tid].y3 * s_data[tid].z1 - \
                s_data[tid].x3 * s_data[tid].y2 * s_data[tid].z1 - \
                s_data[tid].x2 * s_data[tid].y1 * s_data[tid].z3 - \
                s_data[tid].y3 * s_data[tid].z2 * s_data[tid].x1;
}

I got millions of tetrahedrons from other function as an array.

// Host
Triangle *h_triangles = triangles;
double *h_volume;
// Device
Triangle *d_triangles;
double *d_volume;

// define grid and block size
int numThreadsPerBlock = numTriangles;
int numBlocks = numTrianges / 512;

// Shard memory size
int sharedMemSize = numThreadsPerBlock * sizeof(Triangle);

// allocate host and device memory
size_t memSize_triangles = numBlocks * numThreadsPerBlock * sizeof(Triangle);
size_t memSize_volume = numBlocks * numThreadsPerBlock * sizeof(double);

cudaMalloc( (void **) &d_triangles, memSize_triangles );
cudaMalloc( (void **) &d_volume, memSize_volume );

// Copy host array to device array
cudaMemcpy( d_triangles, h_triangles, memSize_triangles, cudaMemcpyHostToDevice );
cudaMemcpy( d_point, h_point, memSize_point, cudaMemcpyHostToDevice );

// launch kernel
dim3 dimGrid(numBlocks);
dim3 dimBlock(numThreadsPerBlock);

getResult<<< dimGrid, dimBlock, sharedMemSize >>>( d_volume, d_triangles);

// block until the device has completed
cudaThreadSynchronize();

// device to host copy
cudaMemcpy( h_volume, d_volume, memSize_volume, cudaMemcpyDeviceToHost );

// free device memory
cudaFree(d_triangles);
cudaFree(d_volume);

// free host memory
free(h_triangles); 
free(h_volume);

Up to now, everything works OK. But I cost more time than I thought to get the volume. My device is Tesla C2050(515Gflops), 20 times faster than my CPU(single-core, 20.25Gflops). But only speed up about 10 times (not including the time copying memory between device and host.)

I'd like to know how can I make it about 20 times faster than the CPU code (for loop to get the volume.).

Thanks!

PS: Maybe cudaMallocPitch() will help me, but the triangles are not matrix, I can't use cudaMemcpy2D() to copy memory instead of cudaMemcpy(). Anyone who can help me about this question?

score 1 · Accepted Answer

The peak performance on GPU usually harder to get compared to CPU. One of the reason is that a lot of kernels are bandwidth-bound rather than computing-bound.

Since your kernel's computing complexity is O(n). You probably should use bandwidth metric to calculate the theoretical peak performance as follows

1024*1024*64 * sizeof(double) * (9  +   1)     / (144e9    *    8/9)     = 42 ms
#tetrahedron                     #input #output   peak mem bw   ECC cost

On the other hand, your kernel could be further optimized.

Choose blockDim/gridDim carefully, wrong numbers sometimes result in 20% performance lost.
Instead of computing one volume per thread, you could computing multiple volumes per thread, which will reduce the thread launching overhead.
Since you don't share data between threads, __syncthreads() may be able to eliminated.
Array of Structures (AoS) often slower than Structure of Arrays (SoA) on GPU due to non-coalesced mem access. you could also try to change your data structure.

Update

Got a new kernel with large L1 cache setup and best blockDim/gridDim choice. It's 15% faster. Here's the code and profile result. My device is M2090.

profile result

#include <stdlib.h>
#include <thrust/transform.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <iterator>
#include <thrust/inner_product.h>

using namespace thrust::placeholders;

struct Triangle
{
    double x1;
    double y1;
    double z1;
    double x2;
    double y2;
    double z2;
    double x3;
    double y3;
    double z3;
};

__global__ void getResultNoSMem(double *d_volume, Triangle *d_triangles)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    d_volume[i] = d_triangles[i].x1 * d_triangles[i].y2 * d_triangles[i].z3 +
            d_triangles[i].y1 * d_triangles[i].z2 * d_triangles[i].x3 +
            d_triangles[i].x2 * d_triangles[i].y3 * d_triangles[i].z1 -
            d_triangles[i].x3 * d_triangles[i].y2 * d_triangles[i].z1 -
            d_triangles[i].x2 * d_triangles[i].y1 * d_triangles[i].z3 -
            d_triangles[i].y3 * d_triangles[i].z2 * d_triangles[i].x1;
}

__global__ void getResult(double *d_volume, Triangle *d_triangles)
{
    extern __shared__ Triangle s_data[];
    int tid = threadIdx.x;
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    s_data[tid] = d_triangles[i];
//  __syncthreads();
    d_volume[i] = s_data[tid].x1 * s_data[tid].y2 * s_data[tid].z3 +
            s_data[tid].y1 * s_data[tid].z2 * s_data[tid].x3 +
            s_data[tid].x2 * s_data[tid].y3 * s_data[tid].z1 -
            s_data[tid].x3 * s_data[tid].y2 * s_data[tid].z1 -
            s_data[tid].x2 * s_data[tid].y1 * s_data[tid].z3 -
            s_data[tid].y3 * s_data[tid].z2 * s_data[tid].x1;
}

__global__ void getResultOpt(double *d_volume, Triangle *d_triangles, int len)
{
    const int gridSize = blockDim.x * gridDim.x;
    int i = blockDim.x * blockIdx.x + threadIdx.x;

    while (i < len)
    {
        d_volume[i] = d_triangles[i].x1 * d_triangles[i].y2 * d_triangles[i].z3 +
                d_triangles[i].y1 * d_triangles[i].z2 * d_triangles[i].x3 +
                d_triangles[i].x2 * d_triangles[i].y3 * d_triangles[i].z1 -
                d_triangles[i].x3 * d_triangles[i].y2 * d_triangles[i].z1 -
                d_triangles[i].x2 * d_triangles[i].y1 * d_triangles[i].z3 -
                d_triangles[i].y3 * d_triangles[i].z2 * d_triangles[i].x1;
        i += gridSize;
    }
}

int main(void)
{
    const int m = 1024 * 1024;
    thrust::host_vector<Triangle> data(m);
    for (int i = 0; i < m; i++)
    {
        data[i].x1 = (double) rand() / RAND_MAX;
        data[i].y1 = (double) rand() / RAND_MAX;
        data[i].z1 = (double) rand() / RAND_MAX;
        data[i].x2 = (double) rand() / RAND_MAX;
        data[i].y2 = (double) rand() / RAND_MAX;
        data[i].z2 = (double) rand() / RAND_MAX;
        data[i].x3 = (double) rand() / RAND_MAX;
        data[i].y3 = (double) rand() / RAND_MAX;
        data[i].z3 = (double) rand() / RAND_MAX;
    }

    thrust::device_vector<Triangle> triangles = data;
    thrust::device_vector<double> volume(m);
    thrust::device_vector<double> volumeOpt(m);

    Triangle* dTriangles = thrust::raw_pointer_cast(&triangles[0]);
    double* dVolume = thrust::raw_pointer_cast(&volume[0]);
    double* dVolumeOpt = thrust::raw_pointer_cast(&volumeOpt[0]);

    int g;
    int b;

    int threadUpperLimit = 48 * 1024 / sizeof(Triangle);

    //for (b = 32; b <= 1024; b += 32)
    {
        b = 64;
        int gridDim = (m + b - 1) / b;
        getResultNoSMem<<<gridDim, b, 0, 0>>>(dVolume, dTriangles);
    }

    //  for (b = 32; b <= threadUpperLimit; b += 32)
    {
        b = 64;
        int gridDim = (m + b - 1) / b;
        getResult<<<gridDim, b, b * sizeof(Triangle), 0>>>(dVolume, dTriangles);
    }

    //for (g = 32; g <= 512; g += 32)
    //  for (b = 32; b <= 1024; b += 32)
    {
        b = 64;
        g = 64;
        getResultOpt<<<g, b, 0, 0>>>(dVolumeOpt, dTriangles, m);
    }

    //for (g = 32; g <= 512; g += 32)
    //  for (b = 32; b <= 1024; b += 32)
    {
        b = 64;
        g = 512;
        cudaFuncSetCacheConfig(getResultOpt, cudaFuncCachePreferL1);
        getResultOpt<<<g, b, 0, 0>>>(dVolumeOpt, dTriangles, m);
    }

    thrust::device_vector<double> X = volume;
    thrust::device_vector<double> Y = volumeOpt;
    thrust::transform(X.begin(), X.end(), Y.begin(), X.begin(), _1 - _2);
    double result = thrust::inner_product(X.begin(), X.end(), X.begin(), 0.0);

    std::cout << "difference: " << result << std::endl;

    return 0;
}

score 1 · Accepted Answer

As Eric's answer implies, your kernel requires nine 64 bit loads and a 64 bit store per thread, but each thread only performs 17 FLOP. That probably means your code is memory bandwidth limited, rather than compute limited, and you shouldn't expect to be able to hit peak FLOP/s throughput for a code of this type.

The key to optimal performance will, therefore, most likely be in memory bandwidth optimization. At the moment, your kernel has several obvious problems, one of which I touched on in a comment. You really don't need shared memory in the kernel as written - it is slower than registers and there is no memory bandwidth improvements to be had by using it. The use of __syncthreads() also adds unnecessary latency to the kernel. You code could be written as simply as this:

__global__ void getresult2(double *d_volume, Triangle *d_triangles)
{
    int i =  blockDim.x * blockIdx.x + threadIdx.x;
    Triangle t = d_triangles[i];
    d_volume[i] = t.x1 * t.y2 * t.z3 + 
                  t.y1 * t.z2 * t.x3 +
                  t.x2 * t.y3 * t.z1 -
                  t.x3 * t.y2 * t.z1 -
                  t.x2 * t.y1 * t.z3 - 
                  t.y3 * t.z2 * t.x1;
}

[disclaimer, never compiled or run, use at own risk]

and I would expect it to perform better than the shared memory version.

The second problem is one of memory coalescing. The structure you have is rather large, and each thread loading a complete instance of Triangle won't be very friendly to memory coalescing or cache reuse. You could try using shared memory to improve the memory load performance and coalesce the writes by doing something like this:

__global__ 
void __launch_bounds__(288, 3)getresult3(double *d_volume, Triangle *d_triangles)
{
    __shared__ double s_data[9*32];

    int i =  blockDim.x * blockIdx.x + threadIdx.x;
    double * t_data = reinterpret_cast<double *>(d_triangles);
    s_data[threadIdx.x] = t_data[i];
    __syncthreads();

    if (threadIdx.x < 32) {
        Triangle * t = reinterpret_cast<Triangle *>(&s_data[9*threadIdx.x]);
        d_volume[i] = t->x1 * t->y2 * t->z3 + 
            t->y1 * t->z2 * t->x3 +
            t->x2 * t->y3 * t->z1 -
            t->x3 * t->y2 * t->z1 -
            t->x2 * t->y1 * t->z3 - 
            t->y3 * t->z2 * t->x1;
    }
}

[disclaimer, never compiled or run, use at own risk]

Here, the block of 288 threads fetches 32 Triangle instances to shared memory in a coalesced load, then the first 32 threads in the block perform the calculations and store 32 results. A scheme like this might prove faster if the kernel is really not achieving a very high fraction of global memory bandwdith throughput. The CUDA toolkit profiling tools have some rather useful performance metric analysis which can help pinpoint the performance bottlenecks in your code. Like all optimization exercises, the key is careful profiling and benchmarking, which is something only you are in a position to do.

c++ - CUDA parallel computing speed up volume calculation

2 回答 2

Update

Related

Reference