parallel-processing - 将一个简单的 C++ 代码片段重写为 CUDA 代码

Question

我编写了以下简单的 C++ 代码。

#include <iostream>
#include <omp.h>

int main()
{
    int myNumber = 0;
    int numOfHits = 0;

    cout << "Enter my Number Value" << endl;
    cin >> myNumber;

    #pragma omp parallel for reduction(+:numOfHits)

    for(int i = 0; i <= 100000; ++i)
    {
        for(int j = 0; j <= 100000; ++j)
        {
            for(int k = 0; k <= 100000; ++k)
            {
                if(i + j + k == myNumber)
                    numOfHits++;
            }
        }
    }

    cout << "Number of Hits" << numOfHits << endl;

    return 0;
}

如您所见，我使用 OpenMP 来并行化最外层的循环。我想做的是在 CUDA 中重写这个小代码。任何帮助都感激不尽。

score 1 · Accepted Answer

好吧，我可以给你一个快速教程，但我不一定会为你写完。

因此，首先，您需要使用 CUDA 设置 MS Visual Studio，这很容易遵循本指南： http: //www.ademiller.com/blogs/tech/2011/05/visual-studio-2010-and -cuda-更容易使用-rc2/

现在您将需要阅读 NVIDIA CUDA 编程指南（免费 pdf）、文档和 CUDA 示例（我强烈推荐用于学习 CUDA 的一本书）。

但是，假设您还没有这样做，以后肯定会这样做。

这是一个算术繁重且数据量极少的计算——实际上它可以在没有这种蛮力方法的情况下相当简单地计算出来，但这不是您要寻找的答案。我为内核建议这样的事情：

__global__ void kernel(int* myNumber, int* numOfHits){

    //a shared value will be stored on-chip, which is beneficial since this is written to multiple times
    //it is shared by all threads
    __shared__ int s_hits = 0;

    //this identifies the current thread uniquely
    int i = (threadIdx.x + blockIdx.x*blockDim.x);
    int j = (threadIdx.y + blockIdx.y*blockDim.y);
    int k = 0;

    //we increment i and j by an amount equal to the number of threads in one dimension of the block, 16 usually, times the number of blocks in one dimension, which can be quite large (but not 100,000)
    for(; i < 100000; i += blockDim.x*gridDim.x){
        for(; j < 100000; j += blockDim.y*gridDim.y){
                  //Thanks to talonmies for this simplification
               if(0 <= (*myNumber-i-j) && (*myNumber-i-j) < 100000){
                  //you should actually use atomics for this
                 //otherwise, the value may change during the 'read, modify, write' process
                  s_hits++;
               }
        }
    }

    //synchronize threads, so we now s_hits is completely updated
    __syncthreads();

    //again, atomics
    //we make sure only one thread per threadblock actually adds in s_hits
    if(threadIdx.x == 0 && threadIdx.y == 0)
        *numOfHits += s_hits;

    return;
}

要启动内核，你需要这样的东西：

dim3 blocks(some_number, some_number, 1); //some_number should be hand-optimized
dim3 threads(16, 16, 1);
kernel<<<blocks, threads>>>(/*args*/);

我知道您可能想要一种快速的方法来做到这一点，但进入 CUDA 并不是一件真正“快速”的事情。如，您需要做一些阅读和一些设置才能使其正常工作；过去，学习曲线并不太高。我还没有告诉你任何关于内存分配的事情，所以你需要这样做（虽然这很简单）。如果您遵循我的代码，我的目标是您必须阅读一些有关共享内存和 CUDA 的内容，因此您已经开始了。祝你好运！

免责声明：我没有测试我的代码，我也不是专家——这可能是愚蠢的。

parallel-processing - 将一个简单的 C++ 代码片段重写为 CUDA 代码

1 回答 1

Related

Reference