我是 Cuda 开发的新手,我决定开始编写小示例脚本以了解它是如何工作的。我决定分享我制作的核函数并计算两个相等大小矩阵的相应行之间的平方欧几里德距离。
__global__ void cudaEuclid( float* A, float* B, float* C, int rows, int cols )
{
    int i, squareEuclDist = 0;
    int r = blockDim.x * blockIdx.x + threadIdx.x; // rows
    //int c = blockDim.y * blockIdx.y + threadIdx.y; // cols
    if( r < rows  ){ // take each row with var r (thread)
        for ( i = 0; i < cols; i++ )//compute squared Euclid dist of each row 
            squareEuclDist  += ( A[r + rows*i] - B[r + rows*i] ) * ( A[r + rows*i] - B[r + rows*i] );
        C[r] = squareEuclDist;
        squareEuclDist = 0;
    }   
}
内核初始化由
int threadsPerBlock = 256;
int blocksPerGrid = ceil( (double) numElements  / threadsPerBlock); 
// numElements  = 1500x200 (matrix size) ==> 1172 blocks/grid
并被称为
cudaEuclid<<<blocksPerGrid, threadsPerBlock>>>( d_A, d_B, d_C, rows, cols );
d_A 和 d_B 是插入的矩阵,在这个大小为 1500 x 200 的示例中。
问题 1:我已经阅读了选择每个块的线程和每个网格数的块的基本理论,但仍然缺少一些东西。我试图在这个简单的内核中理解什么是最佳内核参数初始化,并且我正在寻求一些帮助以开始以 CUDA 方式思考。
问题2:我想问的另一件事是,是否有关于如何提高代码效率的建议?我们可以用它int c = blockDim.y * blockIdx.y + threadIdx.y来让事情更并行吗?共享内存在这里适用吗?
下面附上我的 GPU 信息。
Device 0: "GeForce 9600 GT"
  CUDA Driver Version / Runtime Version          5.5 / 5.0
  CUDA Capability Major/Minor version number:    1.1
  Total amount of global memory:                 512 MBytes (536870912 bytes)
  ( 8) Multiprocessors x (  8) CUDA Cores/MP:    64 CUDA Cores
  GPU Clock rate:                                1680 MHz (1.68 GHz)
  Memory Clock rate:                             700 Mhz
  Memory Bus Width:                              256-bit
  Max Texture Dimension Size (x,y,z)             1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
  Max Layered Texture Size (dim) x layers        1D=(8192) x 512, 2D=(8192,8192) x 512
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 8192
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  768
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             256 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Concurrent kernel execution:                   No
  Device supports Unified Addressing (UVA):      No
  Device PCI Bus ID / PCI location ID:           1 / 0
问题 3:我们可以用 GPU 拥有的共享内存和其他类型的内存来表示全局内存的数量吗?线程数与此有关吗?
问题 4:如果每个块的最大线程数是 512,那么块的每个维度的最大大小怎么可能是 512x512x62(= 16252628 个线程)?与我的网格每个维度的最大尺寸有什么关系?
问题5:使用内存时钟频率我们可以说每秒处理了多少个线程?
更新:
for 循环替换为列线程
__global__ void cudaEuclid( float* A, float* B, float* C, int rows, int cols ){
    int r = blockDim.x * blockIdx.x + threadIdx.x; // rows
    int c = blockDim.y * blockIdx.y + threadIdx.y; // cols
    float x=0;
    if(c < cols && r < rows){
       x = ( A[c + r*cols] - B[c + r*cols] ) * ( A[c + r*cols] - B[c + r*cols] );
     }
     C[r] = x;      
} 
调用:
int threadsPerBlock = 256;
int blocksPerGrid = ceil( (double) numElements  / threadsPerBlock);
cudaEuclid<<<blocksPerGrid, threadsPerBlock>>>( d_A, d_B, d_C, rows, cols );