cuda - CUDA。数据组织

Question

在我的应用程序中，每个线程都需要它自己的数据矩阵。假设我有T线程，每个线程都使用不同的 matrix D[M][N]。

我的问题：如何组织数据结构？

我的解决方案：A我定义了一个T*M*N元素数组。为避免银行冲突，我首先存储D[0][0] T每个线程的元素时间，然后存储D[0][1]... D[0][M-1]，D[1][0]依此类推（如果您像在 matrixT * (M*N)中一样查看此数组，则每个线程都有一列）。通过这种方式，我为不同内存库中的不同线程拥有相同的元素。相应地，我通过以下方式访问D[i][j]线程的元素： .xD[i][j](x) == A[T * (M * i + j) + x]

我的问题：计算复杂索引的计算成本很高。

PS 我有 Nvidia Tesla C2075 (CUDA 2.0)。

score 0 · Accepted Answer

你说M和N可以是几百。为此，您将无法大量使用共享内存（如果有的话）。你也可以仔细观察全局内存消耗（虽然特斯拉有很多内存）！200x200 x 3584threads（我认为 C2075 的最低要求）x sizeof(int) - 产生 547MB 的数据。

全局内存访问模式的工作方式不同。全局内存分为 32、64 和 128B 段。读取的成本大约是每个 warp 的不同段访问的数量。简而言之，它通常归结为 - 您的访问越分散 - 越差。

因此，除非每个线程都在同一索引处访问自己的矩阵（至少在大多数情况下），否则内存组织将不会是有效的。但是，如果上述情况属实 - 那么您描述的布局可能会起作用。

此外，如果您有分散的访问模式，禁用 L1 缓存可能会有所帮助。这是因为 L1 缓存线是 128B，但 L2 只有 32B - 所以你可以减少过度获取。至少 - 试试看:)

为了减轻访问数组的痛苦，我会做这样的事情：

//let the kernel dimentions be known at compile time - you can safe some computation and registers
//assuming one-dimentional kernels

static const int blockSize = ...; //equivalent to blockDim
static const int gridSize = ...; //equivalent to gridDim
static const int rowSize = blockSize * gridSize;

template <typename T, int M, int N>
class MyMatrix {
private:
  T* data; //flattened array in global memory
  int tid;
public:
  __device__ inline MyMatrix(T* dataIsHere) : data(dataIsHere) {
    tid = threadIdx.x+blockDim.x*blockIdx.x;
  }
  __device__ inline T& operator()(int x, int y) {
    return data[(y*M+x)*rowSize+tid];
  }
}

//assuming the matrix size is 200x200 and consists of ints:

__global__ void myKernel(int* flattenedMatrices) {
  MyMatrix<int,200,200> matrix(flattenedMatrices);

  ...

  matrix(2,4) = .... // happily access the matrix for both loads and stores
}

cuda - CUDA。数据组织

1 回答 1

Related

Reference