cuda - 子矩阵计算

Question

我想计算矩阵的两个子矩阵之间的成对距离。例如，我有一个矩阵 A (MxN) 和该矩阵 B1 (mxn) 和 B2 (kxt) 的两个块。更具体地说，我想计算 B1(1,1) 元素与 B2 的所有其他元素的距离，并对所有 B1 元素执行此过程。更清楚地说，B1 和 B2 可能不是矩阵的紧凑部分，基本上我知道的信息是矩阵 A 上 B1 和 B2 的元素的坐标。这是一个例子。

for(int i = 0; i < nRowsCoordsB1 ; i++ ){//nRows of B1
  for(int j = 0; j < nRowsCoordsB2 ; j++ ){//nRows of B2

    //CoordsofB1 is a nRowsB1x2 matrix that contains the element coordinates of the B1 sub matrix

    a_x = CoordsofB1[ i ]; //take the x coord of the corresponding row i
    a_y = CoordsofB1[ i + nRowsCoordsB1 ]; //take the y coord of the corresponding row

    b_x = CoordsofB2[ j ];
    b_y = CoordsofB2[ j + nRowsCoordsB2 ];

    int element1 = A[ a_x + a_y*nRowsofA ];
    int element2 = A[ b_x + b_y*nRowsofA ] ;
    sum +=abs( element1 - element2 ) ;

  }
}
*Output = sum/(float)(numberOfElementsofB1*numberOfElementsofB2);

现在我想用 CUDA 加速计算 :) 因为我是 Cuda 的新手，所以我发现它有点复杂。从现在开始，我认为我已经理解了在矩阵级别分配块线程的逻辑，但事实上我有两个不同大小的矩阵部分 CoordsofB1 和 CoordsofB2，这让我有点困惑如何访问它们采取坐标并在孔矩阵中使用它们。我认为我们应该使用约束在 A 中工作，但我没有明确的想法。

此外，在 for 循环结束时，总和除以一个数量这一事实使我对我们将在 cuda 翻译代码中组合的人感到困惑。

任何建议-片段-示例-参考都会很棒。

PS：我使用列优先排序的原因是因为代码是在 matlab 中评估的。

更新：我们可以分配大小等于最大子矩阵 B1 或 B2 大小的线程块并使用正确的条件与它们一起工作吗？我评论了最后一行，因为我不确定如何处理它。任何意见？

int r = blockDim.x * blockIdx.x + threadIdx.x; // rows
if( r < nRowsCoordsB1 ){       

  a_x = CoordsofB1[ r ]; 
  a_y = CoordsofB1[ r + nRowsCoordsB1 ]; 
  if( r < nRowsCoordsB2 ;){

    b_x = CoordsofB2[ r ];
    b_y = CoordsofB2[ r + nRowsCoordsB2 ];
    int element1 = A[ a_x + a_y*nRowsofA ];
    int element2 = A[ b_x + b_y*nRowsofA ] ;
    sum +=abs( element1 - element2 ) ;

  }
}
//*Output = sum/(float)(numberOfElementsofB1*numberOfElementsofB2);

这里有一个草图在此处输入图像描述

我有 B1 和 B2 内每个元素的坐标，我想计算其中的值之间的差异

[ (B1(1,1) - B2(1,1)) + (B1(1,1) - B2(1,2)) + ... + (B1(1,1) - B2(:,: )) ] +

[ (B1(1,2) - B2(1,1)) + (B1(1,2) - B2(1,2)) + ... + (B1(1,2) - B2(:,: )) ] +

[ (B1(:,:) - B2(1,1)) + (B1(:,:) - B2(1,2)) + ... + (B1(:,:) - B2(:,: )) ]。

score 1 · Accepted Answer

也许下面使用 2D 线程网格的解决方案可以替代 Eric 使用推力来更深入地了解问题。

下面的代码片段仅用于说明这个概念。这是一个未经测试的代码。

二维网格

定义一个partial_distances大小矩阵，该矩阵将包含和nRowsCoordsB1 X nRowsCoordsB2的元素之间所有涉及的绝对值差。在文件中，您将拥有B1B2main

__global__ void distance_calculator(int* partial_distances, int* CoordsofB1, int* CoordsofB2, int nRowsCoordsB1, int nRowsCoordsB2) {

    int i = blockDim.x * blockIdx.x + threadIdx.x;
    int j = blockDim.y * blockIdx.y + threadIdx.y;

    int a_x = CoordsofB1[i]; 
    int a_y = CoordsofB1[i+nRowsCoordsB1];

    int b_x = CoordsofB2[j];
    int b_y = CoordsofB2[j+nRowsCoordsB2];

    partial_distances[j*nRowsCoordsB1+i] = abs(A[a_x+a_y*nRowsofA]-A[b_x+b_y*nRowsofA]);

}

int iDivUp(int a, int b) { return (a % b != 0) ? (a / b + 1) : (a / b); }

#define BLOCKSIZE 32

int main() {

    int* partial_distances; cudaMalloc((void**)&partial_distances,nRowsCoordsB1*nRowsCoordsB2*sizeof(int));

    dim3 BlocSize(BLOCKSIZE,BLOCKSIZE);
    dim3 GridSize;
    GridSize.x = iDivUp(nRowsCoordsB1,BLOCKSIZE);
    GridSize.y = iDivUp(nRowsCoordsB2,BLOCKSIZE);

    distance_calculator<<<GridSize,BlockSize>>>(partial_distances,CoordsofB1,CoordsofB2,nRowsCoordsB1,nRowsCoordsB2);

   REDUCTION_STEP

}

可以实现为对一维归约内核的REDUCTION_STEP迭代调用，以总结与的特定元素相对应的所有元素B1。

另一种方法是使用动态并行性直接在内核中调用缩减例程，但这是一个不适合您使用的卡的选项。

score 1 · Accepted Answer

如果我理解正确，您尝试做的事情可以写在下面的 matlab 代码中。

rep_B1 = repmat(B1(:),  1, length(B2(:)) );
rep_B2 = repmat(B2(:)', length(B1(:), 1) );
absdiff_B1B2 = abs(rep_B1 - repB2);
Result = mean( absdiff_B1B2(:) );

您会注意到，在缩减之前，有一个absdiff_B1B2大小为length(B1(:))x的矩阵length(B2(:))，即m*nx k*t（如果您在一个 CUDA 内核中实现上述代码，则此矩阵永远不会存储到全局内存中）。您可以将此矩阵划分为 16x16 子矩阵，并为每个子矩阵使用一个 256 线程块来将工作负载分解到 GPU。

另一方面，你可以使用推力让你的生活更轻松。

更新

由于B1和B2是的子矩阵A，您可以先使用cudaMemcpy2D()将它们复制到线性空间，然后使用内核构造然后归约矩阵absdiff_B1B2。

对于最终的规范化操作（代码的最后一行），您可以在 CPU 上进行。

这是使用推力的代码来展示如何absdiff_B1B2在单个内核中构造和减少矩阵。但是你会发现构造过程没有使用共享内存并且没有优化。使用共享内存进一步优化将提高性能。

#include <thrust/device_vector.h>
#include <thrust/inner_product.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>

template<typename T>
struct abs_diff
{
    inline __host__ __device__ T operator()(const T& x, const T& y)
    {
        return abs(x - y);
    }
};

int main()
{
    using namespace thrust::placeholders;

    const int m = 98;
    const int n = 87;
    int k = 76;
    int t = 65;
    double result;

    thrust::device_vector<double> B1(m * n, 1.0);
    thrust::device_vector<double> B2(k * t, 2.0);

    result = thrust::inner_product(
            thrust::make_permutation_iterator(
                    B1.begin(),
                    thrust::make_transform_iterator(
                            thrust::make_counting_iterator(0),
                            _1 % (m * n))),
            thrust::make_permutation_iterator(
                    B1.begin(),
                    thrust::make_transform_iterator(
                            thrust::make_counting_iterator(0),
                            _1 % (m * n))) + (m * n * k * t),
            thrust::make_permutation_iterator(
                    B2.begin(),
                    thrust::make_transform_iterator(
                            thrust::make_counting_iterator(0),
                            _1 / (m * n))),
            0.0,
            thrust::plus<double>(),
            abs_diff<double>());
    result /= m * n * k * t;

    std::cout << result << std::endl;

    return 0;
}

cuda - 子矩阵计算

2 回答 2

更新

Related

Reference