c++ - GPU 如何帮助改进迭代问题？

Question

我正在使用 C++ 来求解 k 耦合迭代方程。例如，对于 3 联轴器的情况，

 f(n+1) = g(n) + 2*h(n) + c;
 g(n+1) = 0.5*f(n+1) - h(n);
 h(n+1) = ( f(n+1)+g(n+1) )/2;

其中 C 是常数。在 C/C++ 中，实现非常简单

#include <vector>
#include <iostream>
using namespace std;

void main(void)
{
  double c= 0.24;
  long k=0;
  vector<double> f(900000000), g(900000000), h(900000000);

  while (k<10000000)
  {
    f[0] = g[0] = h[0] = rand(); // the initial values of f, g, h are randomly picked
    for (long n=1; n<900000000; n++)
    {
      f[n+1] = g[n] + 2*h[n] + c;
      g[n+1] = 0.5*f[n+1] - h[n];
      h[n+1] = ( f[n+1]+g[n+1] )/2; 
    }
    //if the final value of f, g, h satisfying some condition then record it and go for next iteration 
    if (is_good(f[899999999], g[899999999], h[899999999]))
    {
      // record f[899999999], g[899999999], h[899999999]
      k++;
    }
  }
}

这段代码非常慢，因为它进展缓慢并且取决于随机初始值。我之前没有对 GPU 进行编程，但我读了一些介绍，它说 GPU 在某些情况下非常快。我读了几个例子，我觉得 GPU 只能用于“可分割”的情况（我的意思是任务可以分为子任务，因此可以并行实现）。我想知道这对我的案子有多大帮助。任何想法或建议都将受到高度欢迎。

score 2 · Accepted Answer

您的程序可以很容易地在while (k<10000000)循环上并行化。事实上，由于程序终止条件是未知的迭代次数（达到 10M 好的集合），因此您基本上可以删除在内核中显示的整个代码并按原样运行，只需进行一些小的修改。

#include <curand.h>
#include <curand_kernel.h>

__constant__ double c = 0.24;
__device__ volatile unsigned int k = 0;
#define SCALE 32767.0
#define NUM_GOOD 10000000

__device__ int is_good(double f, double g, double h){
  if (....){
    ...
    return 1;
  }
  return 0;
}

__global__ void initCurand(curandState *state, unsigned long seed){
  int idx = threadIdx.x + blockIdx.x*blockDim.x;
  curand_init(seed, idx, 0, &state[idx]);
}

__global__ void mykernel(curandState *devStates, double *good_f, double *good_g, double *good_h){
  int idx = threadIdx.x + blockDim.x*blockIdx.x;
  double f0, g0, h0, f1, g1, h1;
  curandState localState = devStates[idx];
  while (k<NUM_GOOD){
    // assuming you wanted independent starting values for f, g, h
    f0 = (double)(curand_uniform(&localState)*SCALE);
    g0 = (double)(curand_uniform(&localState)*SCALE);
    h0 = (double)(curand_uniform(&localState)*SCALE);
    for (int i = 0; i< 450000000; i++){
      f1 = g0 + 2*h0 + c;
      g1 = 0.5*f1 - h0;
      h1 = (f1+g1 )/2;
      f0 = g1 + 2*h1 + c;
      g0 = 0.5*f0 - h1;
      h0 = (f0+g0 )/2;}
    if (is_good(f1, g1, h1))
    {
      unsigned int next =  atomicAdd(&k, 1);
      if (next<NUM_GOOD){
        good_f[next] = f1;
        good_g[next] = g1;
        good_h[next] = h1;}
    }
  }
}

上面的代码只是一个大纲，可能会有一些错误，显然这里没有定义所有的东西。

您可以使用启动的实际线程数来查看运行速度最快的线程。所有启动的线程都将填充“好”堆栈，直到它被填充。然后每个线程会检测到堆栈已满并退出。

编辑：回答以下一些问题：

似乎“int idx = threadIdx.x + blockDim.x*blockIdx.x;” 是GPU的东西，我认为它与GPU中的线程有关，那么它对GPU编程至关重要吗？

是的，这些变量就像threadIdx.x是 CUDA 中的“内置”变量，它允许每个线程执行不同的操作（在这种情况下，从不同的随机值开始）。

其次，您提供的所有代码看起来都像普通的 C++ 代码。但是您放置了“GPU关键部分”，那么我需要在该部分中使用任何特殊语法还是就像常规的c ++代码一样？

是的，很多 CUDA 内核代码可以是普通的 C++ 代码，通常类似于您可能编写的在 CPU 上执行相同操作的代码。在这种情况下，我提到了一个关键部分并链接了一个示例，但是在考虑之后，一个关键部分（在这种情况下用于限制对数据区域的访问，以便 GPU 线程在更新时不会相互踩踏“好”值）在这里是多余的。只需要使用原子操作在堆栈中为每个想要填充好值的线程保留一个“点”。我已经相应地修改了代码。

score 1 · Accepted Answer

根据

 while (k<10000000)

你正试图找到 10M 好的{f, h, g}。

在你的单线程 CPU 代码中，你是一个一个找到它们，而在 GPU 中，很容易启动数千个线程并行找到满意的结果，直到总数达到 10M。

对于耦合迭代部分，您仍然需要以传统方式计算它们。但是您仍然可以通过将方程简化为来提高这部分的性能

f(n+1) = 1   *g(n) + 2*h(n) +      c;
g(n+1) = 0.5 *g(n)          +  0.5*c;
h(n+1) = 0.75*g(n) + 1*h(n) + 0.75*c;

A向量的变换矩阵[f,g,h,c]'是（在matlab代码中）

A = [ 0 1 2 1 ; 0 .5 0 .5; 0 .75 1 .75 ; 0 0 0 0];

然后我们有[f,g,h,c]'{n}=A^n * [f,g,h,c]'{0}. 你会发现在几次迭代中收敛A^n。[0 3 2 3; 0 0 0 0; 0 1.5 1 1.5; 0 0 0 0]

c++ - GPU 如何帮助改进迭代问题？

2 回答 2

Related

Reference