matrix - 使用 cuBLAS 访问子矩阵

Question

我已阅读以下帖子

我想做一些类似的从 Fortran 调用 cuBLAS 例程。

基本上，我有一个大矩阵被分割成3 x 3块，在循环的每一步中，分割都会发生变化。目前，我为每个单独的子块分配/释放指针，并在每一步将矩阵的相关部分复制到设备和从设备复制。这会产生很多我希望消除的开销。这可行吗？

score 4 · Accepted Answer

您可以使用与使用主机指针相同的方式在主机代码中进行设备指针运算。例如，如果您在 GPU 上存储了一个 MxN 矩阵：

 float *A_d;
 cudaMalloc((void **)&A_d, size_t(M*N)*sizeof(float));

并且您想对从 (x1,y1) 开始的子矩阵进行操作，然后您将传递A+x1+M*y1 给任何需要矩阵作为参数的 CUBLAS 函数。

score 3 · Accepted Answer

talonmies 已经满意地回答了这个问题。为了支持他的回答并可能对其他用户有用，我在这里提供了一个完整的示例，说明如何使用cublas<t>gemm在完整矩阵的子矩阵之间执行乘法A以及B如何将结果分配给完整矩阵的子矩阵C。

尽管问题与 Fortran 有关，但下面的代码是用 C/C++ 给出的，因为我没有将 Fortran 与 CUDA 结合使用，而且许多用户将 CUDA 与 C/C++ 结合使用。

该代码利用

访问子矩阵的指针算法；
前导维度和子矩阵维度的概念。

下面的代码考虑了三个矩阵：

A- 10 x 9;
B- 15 x 13;
C- 10 x 12。

矩阵C被初始化为所有10s。该代码在 Matlab 语言中执行以下子矩阵乘法：

C(1+x3:5+x3,1+y3:3+y3) = A(1+x1:5+x1,1+y1:4+y1) * B(1+x2:4+x2,1+y2:3+x2);

Utilities.cu和Utilities.cuh文件在此处保留，此处省略。

#include <thrust/device_vector.h>
#include <thrust/random.h>

#include <cublas_v2.h>

#include "Utilities.cuh"

/********/
/* MAIN */
/********/
int main()
{
    /**************************/
    /* SETTING UP THE PROBLEM */
    /**************************/

    //const int Nrows1 = 10;            // --- Number of rows of matrix 1
    //const int Ncols1 = 10;            // --- Number of columns of matrix 1

    //const int Nrows2 = 15;            // --- Number of rows of matrix 2
    //const int Ncols2 = 15;            // --- Number of columns of matrix 2

    //const int Nrows3 = 12;            // --- Number of rows of matrix 3
    //const int Ncols3 = 12;            // --- Number of columns of matrix 3

    const int Nrows1 = 10;          // --- Number of rows of matrix 1
    const int Ncols1 = 9;           // --- Number of columns of matrix 1

    const int Nrows2 = 15;          // --- Number of rows of matrix 2
    const int Ncols2 = 13;          // --- Number of columns of matrix 2

    const int Nrows3 = 10;          // --- Number of rows of matrix 3
    const int Ncols3 = 12;          // --- Number of columns of matrix 3

    const int Nrows = 5;            // --- Number of rows of submatrix matrix 3 = Number of rows of submatrix 1
    const int Ncols = 3;            // --- Number of columns of submatrix matrix 3 = Number of columns of submatrix 2

    const int Nrowscols = 4;        // --- Number of columns of submatrix 1 and of rows of submatrix 2

    const int x1 = 3;               // --- Offset for submatrix multiplication along the rows
    const int y1 = 2;               // --- Offset for submatrix multiplication along the columns

    const int x2 = 6;               // --- Offset for submatrix multiplication along the rows
    const int y2 = 4;               // --- Offset for submatrix multiplication along the columns

    const int x3 = 3;               // --- Offset for submatrix multiplication along the rows
    const int y3 = 5;               // --- Offset for submatrix multiplication along the columns

    // --- Random uniform integer distribution between 0 and 100
    thrust::default_random_engine rng;
    thrust::uniform_int_distribution<int> dist(0, 20);

    // --- Matrix allocation and initialization
    thrust::device_vector<float> d_matrix1(Nrows1 * Ncols1);
    thrust::device_vector<float> d_matrix2(Nrows2 * Ncols2);
    for (size_t i = 0; i < d_matrix1.size(); i++) d_matrix1[i] = (float)dist(rng);
    for (size_t i = 0; i < d_matrix2.size(); i++) d_matrix2[i] = (float)dist(rng);

    printf("\n\nOriginal full size matrix A\n");
    for(int i = 0; i < Nrows1; i++) {
        std::cout << "[ ";
        for(int j = 0; j < Ncols1; j++) 
            std::cout << d_matrix1[j * Nrows1 + i] << " ";
        std::cout << "]\n";
    }

    printf("\n\nOriginal full size matrix B\n");
    for(int i = 0; i < Nrows2; i++) {
        std::cout << "[ ";
        for(int j = 0; j < Ncols2; j++) 
            std::cout << d_matrix2[j * Nrows2 + i] << " ";
        std::cout << "]\n";
    }

    /*************************/
    /* MATRIX MULTIPLICATION */
    /*************************/
    cublasHandle_t handle;

    cublasSafeCall(cublasCreate(&handle));

    thrust::device_vector<float> d_matrix3(Nrows3 * Ncols3, 10.f);

    float alpha = 1.f;
    float beta  = 0.f;
    cublasSafeCall(cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, Nrows, Ncols, Nrowscols, &alpha,
                   thrust::raw_pointer_cast(d_matrix1.data())+x1+Nrows1*y1, Nrows1, thrust::raw_pointer_cast(d_matrix2.data())+x2+Nrows2*y2, Nrows2,
                   &beta, thrust::raw_pointer_cast(d_matrix3.data())+x3+Nrows3*y3, Nrows3));

    printf("\n\nResult full size matrix C\n");
    for(int i = 0; i < Nrows3; i++) {
        std::cout << "[ ";
        for(int j = 0; j < Ncols3; j++) 
            std::cout << d_matrix3[j * Nrows3 + i] << " ";
        std::cout << "]\n";
    }

    return 0; 
}

matrix - 使用 cuBLAS 访问子矩阵

2 回答 2

Related

Reference