c# - Parallel computing of array elements on GPU

Question

I am creating a database using C#. The problem is I have close to 4 million datapoints and it takes a lot of time to complete the database (maybe several month). The code looks like something like this.

int[,,,] Result1=new int[10,10,10,10];
int[,,,] Result2=new int[10,10,10,10];
int[,,,] Result3=new int[10,10,10,10];
int[,,,] Result4=new int[10,10,10,10];

for (int i=0;i<10;i++)
{
  for (int j=0;j<10;j++)
  {
    for (int k=0;k<10;k++)
    {
      for (int l=0;l<10;l++)
      {
        Result1[i,j,k,l]=myFunction1(i,j,k,l);
        Result2[i,j,k,l]=myFunction2(i,j,k,l);
        Result3[i,j,k,l]=myFunction3(i,j,k,l);
        Result4[i,j,k,l]=myFunction4(i,j,k,l);
      }
    }
  }
}

All the elements of the Result array are completely independent of each other. My PC has 8 cores and I have create a thread for each of myFunction methods, but still the whole process would take a lot simply because there are many cases. I am wondering if there is any way to run this on GPU rather than CPU. I have not done it before and I do not know how its gonna work. I do appreciate if someone can help me on this.

score 1 · Accepted Answer

您可以考虑使用 C++ AMP 重写应用程序的这一部分，并从您的 .NET 代码中调用它。有关详细信息，请参阅http://blogs.msdn.com/b/nativeconcurrency/archive/2012/08/30/learn-c-amp.aspx

但是，在您显示的代码中，有 40,000 个数据点，而不是 4,000,000 个。

一个月大约有 260 万秒。对于 40,000 个数据点，每个数据点可以为您提供超过一分钟的时间。（即使你确实有 400 万个数据点，每个数据点仍然会超过半秒。）我不知道这些函数在做什么，但我会惊讶于需要运行这么长时间的东西是在 GPU 上运行的理想选择。

也许您需要重新审视这些函数中使用的算法，看看它们是否可以优化。您甚至可能不得不重新考虑您的想法，以独立于其他数据点来计算每个数据点。如果您已经知道其他一些结果，您确定不能更有效地计算一个结果吗？

更新：

我所说的最后一句话的意思是，可能会进行重复计算。例如，如果由 by 完成的部分计算myFunction1仅取决于前两个参数，则可以按如下方式重构代码：

for (int i = 0; i < 10; i++)
{
  for (int j = 0; j < 10; j++)
  {
    var commonPartValue = commonPart(i, j);

    for (int k = 0; k < 10; k++)
    {
      for (int l = 0; l < 10; l++)
      {
        Result1[i, j, k, l] = myFunction1b(i, j, k, l, commonPartValue);
      }
    }
  }
}

最终的结果是你计算了这个“公共部分”一次，而你曾经这样做过一百次。

另一种情况是，您可以使用先前的结果更有效地计算结果，而不是必须从头开始计算。例如，n² 可以很容易地计算为 n * n，但如果您已经知道 (n - 1)²，则 n² = (n - 1)² + 2 * n - 1。在整数算术中，这意味着您替换 a乘以移位和递减，这样更快。

现在，我并不是说您的问题与这些示例一样简单，但我是说您应该首先寻找这些优化，然后再寻找更好的编译器或不同的硬件。

另外，作为旁注：我假设您将计算的内容存储在磁盘上，而不是存储在 RAM 中的数组中。我可不想等一个月等结果出来，然后就停电了……

score 1 · Accepted Answer

是的，这些场景的直觉是使用多线程/甚至 GPU 来加速。但重要的是要弄清楚该场景是否适合并行计算。

正如您所建议的，这些数据集彼此独立，但是当您在 8 个内核上运行多线程版本时，没有明显的改进：这表明潜在的问题：您关于数据集独立性的陈述是错误的，或者您对多线程的实现-线程代码未优化。我建议您首先调整您的代码以查看改进，然后寻找将其移植到 GPU 平台的方法。

或者您可以查看适用于并行线程/ GPU 内核的OPENCL 。但重要的是要弄清楚你的问题是否真的适合并行计算

score 0 · Accepted Answer

我不认为您的代码示例使用了所有八个内核 - 只有一个。以下应使用全部 8 个：

 private void Para()
    {
        int[, , ,] Result1 = new int[10, 10, 10, 10];
        int[, , ,] Result2 = new int[10, 10, 10, 10];
        int[, , ,] Result3 = new int[10, 10, 10, 10];
        int[, , ,] Result4 = new int[10, 10, 10, 10];

        Parallel.For(0L, 10, i =>
        {
            Parallel.For(0L, 10, j =>
            {
                Parallel.For(0L, 10, k =>
                {
                    Parallel.For(0L, 10, l =>
                    {
                        Result1[i, j, k, l] = myFunction1(i, j, k, l);
                        Result2[i, j, k, l] = myFunction2(i, j, k, l);
                        Result3[i, j, k, l] = myFunction3(i, j, k, l);
                        Result4[i, j, k, l] = myFunction4(i, j, k, l);
                    });
                });
            });
        });
    }

如果这还不够，那么看看 Cudafy应该比用 C++ 重写所有复杂的函数更容易。

c# - Parallel computing of array elements on GPU

3 回答 3

Related

Reference