matrix - cuda 大矩阵和块/线程

Question

我在如何处理大矩阵方面遇到了一些问题。就像在另一个问题中解释的那样，我有一个可以在大方阵（如 5k-10k）上工作的程序。计算部分是正确的（仍然没有 100% 优化），我已经用较小的方阵（如 256-512）对其进行了测试。这是我的代码：

#define N 10000
#define RADIUS 100
#define SQRADIUS RADIUS*RADIUS
#define THREADS 512

//many of these device functions are declared
__device__ unsigned char avg(const unsigned char *src, const unsigned int row, const unsigned int col) {
    unsigned int sum = 0, c = 0;

    //some work with radius and stuff

    return sum;
}

__global__ void applyAvg(const unsigned char *src, unsigned char *dest) {
    unsigned int tid = blockDim.x * blockIdx.x + threadIdx.x, tmp = 0;
    unsigned int stride = blockDim.x * gridDim.x;
    int col = tid%N, row = (int)tid/N;

    while(tid < N*N) {
        if(row * col < N * N) {
            //choose which of the __device__ functions needs to be launched
        }

        tid += stride;
        col = tid%N, row = (int)tid/N;
    }
    __syncthreads();
}

int main( void ) {
    cudaError_t err;
    unsigned char *base, *thresh, *d_base, *d_thresh, *avg, *d_avg;
    int i, j;

    base = (unsigned char*)malloc((N * N) * sizeof(unsigned char));
    thresh = (unsigned char*)malloc((N * N) * sizeof(unsigned char));
    avg = (unsigned char*)malloc((N * N) * sizeof(unsigned char));

    err = cudaMalloc((void**)&d_base, (N * N) * sizeof(unsigned char));
    if(err != cudaSuccess) {printf("ERROR 1"); exit(-1);}
    err = cudaMalloc((void**)&d_thresh, (N * N) * sizeof(unsigned char));
    if(err != cudaSuccess) {printf("ERROR 2"); exit(-1);}
    err = cudaMalloc((void**)&d_avg, (N * N) * sizeof(unsigned char));
    if(err != cudaSuccess) {printf("ERROR 3"); exit(-1);}

    for(i = 0; i < N * N; i++) {
        base[i] = (unsigned char)(rand() % 256);
    }

    err = cudaMemcpy(d_base, base, (N * N) * sizeof(unsigned char), cudaMemcpyHostToDevice);
    if(err != cudaSuccess){printf("ERROR 4"); exit(-1);}

    //more 'light' stuff to do before the 'heavy computation'

    applyAvg<<<(N + THREADS - 1) / THREADS, THREADS>>>(d_thresh, d_avg);

    err = cudaMemcpy(thresh, d_thresh, (N * N) * sizeof(unsigned char), cudaMemcpyDeviceToHost);
    if(err != cudaSuccess) {printf("ERROR 5"); exit(-1);}
    err = cudaMemcpy(avg, d_avg, (N * N) * sizeof(unsigned char), cudaMemcpyDeviceToHost);
    if(err != cudaSuccess) {printf("ERROR 6"); exit(-1);}

    getchar();
    return 0;
}

当使用大矩阵（如 10000 x 10000）和半径为 100（这是我向前看矩阵中每个点的“距离”）启动问题时，需要花费大量时间。

我相信问题在于applyAvg<<<(N + THREADS - 1) / THREADS, THREADS>>>（我决定运行多少块和线程）和applyAvg(...)方法（步幅和 tid）。鉴于矩阵的大小可以从 5k 到 10k 不等，有人可以澄清一下决定启动多少块/线程的最佳方法吗？

score 1 · Accepted Answer

我想你想做的是图像过滤/卷积。根据您当前的 cuda 内核，您可以做两件事来提高性能。

使用二维线程/块来避免/和%运算符。他们很慢。
使用共享内存来减少全局内存带宽。

这是关于图像卷积的白皮书。它展示了如何使用 CUDA 实现高性能的 box filer。

http://docs.nvidia.com/cuda/samples/3_Imaging/convolutionSeparable/doc/convolutionSeparable.pdf

Nvidia cuNPP 库还提供函数 nppiFilterBox() 和 nppiFilterBox()，因此您不需要编写自己的内核。这是文档和示例。

http://docs.nvidia.com/cuda/cuda-samples/index.html#box-filter-with-npp

NPP doc pp.1009 http://docs.nvidia.com/cuda/pdf/NPP_Library.pdf

matrix - cuda 大矩阵和块/线程

1 回答 1

Related

Reference