c++ - 改进稀疏线性系统的解

Question

我在 Linux 系统上用 C++ 编写了一个代码，它解决了线性系统A x = b，其中A是使用以下两种方法的稀疏对称矩阵：

用于UMFPACK顺序分解并进行向后前向替换。
使用UMFPACK顺序分解，然后使用cuSPARSE库进行向后前向替换。

我的系统配置是：CUDA 5.0UMFPACK版本，5.6.2版本，Linux内核版本Debian 3.2.46-1，使用的显卡：GeForce GTX Titan。

从理论上讲，第二种方法应该比第一种方法执行得更好，并且错误最少或没有错误。但是，我观察到以下问题：

使用UMFPACK函数umfpack_di_solve的向后/向前替换几乎2x比 CUDA 变体快。
对于某些矩阵，使用 CUDA 得到的结果之间的误差UMFPACK非常大，最大误差为3.2537，而对于其他矩阵，则为1e-16.

附件是我的 tar 文件，其中包含以下组件：

一个文件夹 factorize_copy，其中包含我用来求解线性系统的主文件fc.cu。它从同样存在于同一目录中的grid_*_CSC.m文件中读取稀疏矩阵。为方便起见，提供的三个稀疏矩阵的结果也在文本文件中给出。
一个包含用于编译和运行的所有依赖项的文件夹UMFPACK（我们也将其用于计算）。

tar 文件的链接是 https://www.dropbox.com/s/9qfs5awclshyk3b/code.tar.gz

如果您希望运行代码，我在 factorize_copy 目录中提供了我在系统中使用的MAKEFILE 。您可能需要重新编译UMFPACK库。

下面还显示了我们的稀疏矩阵程序的示例输出586 x 586（请注意，与我们检查的其他稀疏矩阵相比，这种情况下的错误非常高）。

***** 阅读网格

    阅读网格成功

***** 求解大小为 586x586 的稀疏矩阵

***** 解决 umfpack 上的网格问题

***** 分解网格

-------------- umfpack 分解的 CPU TIME 为：0.00109107

-------------- umfpack 分解的挂钟时间为：0

    成功分解网格

    成功解决 umfpack 上的网格

-------------- umfpack 解决的 CPU TIME 为：6.281e-05

***** 分配 GPU 内存和复制数据

---------------- 分配 GPU 内存和复制数据的 CPU 时间：1.6

***** 执行 b = P*b 以说明 A 中的行排序

    矩阵向量 (Pb) 乘法成功

***** 求解系统：LUx=b

    分析 Ly = b 成功

    求解 Ly = b 成功

    分析 Ux = y 成功

    求解 Ux = y 成功

***** 执行 x = Q*x 以说明 A 中的列排序

    矩阵向量 (Qx) 乘法成功

---------- GPU求解时间为：5.68029 ms

##### UMFPACK 和 CUDA 之间的最大误差：3.2537

##### UMFPACK 和 CUDA 之间的平均误差：0.699926

***** 将结果写入输出文件

    结果写入文件“vout_586.m”和文件“vout_umfpack_586.m”

（操作成功！）

如果有人能指出在这种情况下可能出现的错误，我将不胜感激。如果我错过了使用 CUDA 解决稀疏线性系统的更好方法，请告诉我。

编辑：我弄清楚为什么它在某些情况下会出错，而在某些情况下不会。在代码中调用内核函数时，每个块的线程数有误。但是，我仍然有获得加速的问题。

score 4 · Accepted Answer

如果您正在处理一个在 CPU 上花费亚毫秒时间的问题，那么考虑到 gpu 计算所涉及的所有延迟，您几乎不能指望 gpu 执行得更快。

score 1 · Accepted Answer

这篇文章考虑了稀疏线性系统的快速求解这个非常重要的问题。

截至 2015 年 11 月，该cuSPARSE库提供了基于 LU 分解的稀疏线性系统解决方案的例程，特别是

cusparse<t>csrilu02

和

cusparse<t>csrsv2_solve

此外，cuSPARSE提供

cusparse<t>csrcolor

它实现了图形着色。在不完全 LU 分解中使用图着色在

图着色：不完全 LU 分解的更多并行性

和

M.Naumov, P.Castonguay, J. Cohen，“并行图着色与 GPU 上不完全 LU 分解的应用”，NVIDIA 研究技术报告，2015 年 5 月。

这个想法是将图形着色算法应用于与系统的系数矩阵相关联的行依赖图，然后相应地重新排序系统方程，以便 LU 分解例程可以提取更多的并行度。

下面，请使用上述想法找到一个完整的示例：

#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <assert.h>

#include "Utilities.cuh"

#include <cuda_runtime.h>
#include <cusparse_v2.h>

#define BLOCKSIZE   256

/**************************/
/* SETTING UP THE PROBLEM */
/**************************/
void setUpTheProblem(double **h_A_dense, double **h_x_dense, double **d_A_dense, double **d_x_dense, const int N) {

    // --- Host side dense matrix
    h_A_dense[0] = (double*)calloc(N * N, sizeof(*h_A_dense));

    // --- Column-major ordering
    h_A_dense[0][0] = 0.4612f;  h_A_dense[0][4] = -0.0006f; h_A_dense[0][8]  = 0.f; h_A_dense[0][12] = 0.0f; 
    h_A_dense[0][1] = -0.0006f; h_A_dense[0][5] = 0.f;  h_A_dense[0][9]  = 0.0723f; h_A_dense[0][13] = 0.04f; 
    h_A_dense[0][2] = 0.3566f;  h_A_dense[0][6] = 0.0723f;  h_A_dense[0][10] = 0.f; h_A_dense[0][14] = 0.0f; 
    h_A_dense[0][3] = 0.0f;     h_A_dense[0][7] = 0.0f;     h_A_dense[0][11] = 1.0f;    h_A_dense[0][15] = 0.1f; 

    h_x_dense[0]    = (double *)malloc(N * sizeof(double)); 
    h_x_dense[0][0] = 100.0;  h_x_dense[0][1] = 200.0; h_x_dense[0][2] = 400.0; h_x_dense[0][3] = 500.0;

    // --- Create device arrays and copy host arrays to them
    gpuErrchk(cudaMalloc(&d_A_dense[0], N * N * sizeof(double)));
    gpuErrchk(cudaMemcpy(d_A_dense[0], h_A_dense[0], N * N * sizeof(double), cudaMemcpyHostToDevice));

    gpuErrchk(cudaMalloc(&d_x_dense[0], N * sizeof(double)));   
    gpuErrchk(cudaMemcpy(d_x_dense[0], h_x_dense[0], N * sizeof(double), cudaMemcpyHostToDevice));
}

/************************/
/* FROM DENSE TO SPARSE */
/************************/
void fromDenseToSparse(const cusparseHandle_t handle, double *d_A_dense, double **d_A, int **d_A_RowIndices, int **d_A_ColIndices, int *nnz, 
                       cusparseMatDescr_t *descrA, const int N) {

    cusparseSafeCall(cusparseCreateMatDescr(&descrA[0]));
    cusparseSafeCall(cusparseSetMatType     (descrA[0], CUSPARSE_MATRIX_TYPE_GENERAL));
    cusparseSafeCall(cusparseSetMatIndexBase(descrA[0], CUSPARSE_INDEX_BASE_ZERO));  

    nnz[0] = 0;                             // --- Number of nonzero elements in dense matrix
    const int lda = N;                      // --- Leading dimension of dense matrix

    // --- Device side number of nonzero elements per row
    int *d_nnzPerVector;    gpuErrchk(cudaMalloc(&d_nnzPerVector, N * sizeof(int)));
    cusparseSafeCall(cusparseDnnz(handle, CUSPARSE_DIRECTION_ROW, N, N, descrA[0], d_A_dense, lda, d_nnzPerVector, &nnz[0]));

    // --- Host side number of nonzero elements per row
    int *h_nnzPerVector = (int *)malloc(N * sizeof(int));
    gpuErrchk(cudaMemcpy(h_nnzPerVector, d_nnzPerVector, N * sizeof(int), cudaMemcpyDeviceToHost));

    printf("Number of nonzero elements in dense matrix = %i\n\n", nnz[0]);
    for (int i = 0; i < N; ++i) printf("Number of nonzero elements in row %i = %i \n", i, h_nnzPerVector[i]);
    printf("\n");

    // --- Device side sparse matrix
    gpuErrchk(cudaMalloc(&d_A[0], nnz[0] * sizeof(double)));

    gpuErrchk(cudaMalloc(&d_A_RowIndices[0], (N + 1) * sizeof(int)));
    gpuErrchk(cudaMalloc(&d_A_ColIndices[0], nnz[0]  * sizeof(int)));

    cusparseSafeCall(cusparseDdense2csr(handle, N, N, descrA[0], d_A_dense, lda, d_nnzPerVector, d_A[0], d_A_RowIndices[0], d_A_ColIndices[0]));

    // --- Host side sparse matrix
    double *h_A = (double *)malloc(nnz[0] * sizeof(double));        
    int *h_A_RowIndices = (int *)malloc((N + 1) * sizeof(*h_A_RowIndices));
    int *h_A_ColIndices = (int *)malloc(nnz[0] * sizeof(*h_A_ColIndices));
    gpuErrchk(cudaMemcpy(h_A, d_A[0], nnz[0] * sizeof(double), cudaMemcpyDeviceToHost));
    gpuErrchk(cudaMemcpy(h_A_RowIndices, d_A_RowIndices[0], (N + 1) * sizeof(int), cudaMemcpyDeviceToHost));
    gpuErrchk(cudaMemcpy(h_A_ColIndices, d_A_ColIndices[0], nnz[0] * sizeof(int), cudaMemcpyDeviceToHost));

    printf("\nOriginal matrix in CSR format\n\n");
    for (int i = 0; i < nnz[0]; ++i) printf("A[%i] = %f ", i, h_A[i]); printf("\n");

    printf("\n");
    for (int i = 0; i < (N + 1); ++i) printf("h_A_RowIndices[%i] = %i \n", i, h_A_RowIndices[i]); printf("\n");

    for (int i = 0; i < nnz[0]; ++i) printf("h_A_ColIndices[%i] = %i \n", i, h_A_ColIndices[i]);    

}

/******************/
/* GRAPH COLORING */
/******************/
__global__ void setRowIndices(int *d_B_RowIndices, const int N) {

    const int tid = threadIdx.x + blockDim.x * blockIdx.x;

    if (tid == N)       d_B_RowIndices[tid] = N;
    else if (tid < N)   d_B_RowIndices[tid] = tid;

}

__global__ void setB(double *d_B, const int N) {

    const int tid = threadIdx.x + blockDim.x * blockIdx.x;

    if (tid < N)    d_B[tid] = 1.f;

}

void graphColoring(const cusparseHandle_t handle, const int nnz, const cusparseMatDescr_t descrA, const double fractionToColor, double *d_A, 
                   const int *d_A_RowIndices, const int *d_A_ColIndices, double **d_B, int **d_B_RowIndices, int **d_B_ColIndices, 
                   cusparseMatDescr_t *descrB, const int N) {

    cusparseColorInfo_t info;       cusparseSafeCall(cusparseCreateColorInfo(&info));

    int ncolors;
    int *d_coloring;        gpuErrchk(cudaMalloc(&d_coloring, N * sizeof(double)));
    gpuErrchk(cudaMalloc(&d_B_ColIndices[0], N * sizeof(double)));
    cusparseSafeCall(cusparseDcsrcolor(handle, N, nnz, descrA, d_A, d_A_RowIndices, d_A_ColIndices, &fractionToColor, &ncolors, d_coloring,
                                       d_B_ColIndices[0], info));

    int *h_coloring     = (int *)malloc(N * sizeof(double));
    int *h_B_ColIndices = (int *)malloc(N * sizeof(double));
    gpuErrchk(cudaMemcpy(h_coloring, d_coloring, N * sizeof(double), cudaMemcpyDeviceToHost));
    gpuErrchk(cudaMemcpy(h_B_ColIndices, d_B_ColIndices[0], N * sizeof(double), cudaMemcpyDeviceToHost));

    for (int i = 0; i < N; i++) printf("h_coloring = %i; h_B_ColIndices = %i\n", h_coloring[i], h_B_ColIndices[i]);

    gpuErrchk(cudaMalloc(&d_B_RowIndices[0], (N + 1) * sizeof(int)));
    int *h_B_RowIndices = (int *)malloc((N + 1) * sizeof(double));
    setRowIndices<<<iDivUp(N + 1, BLOCKSIZE), BLOCKSIZE>>>(d_B_RowIndices[0], N);

    gpuErrchk(cudaMemcpy(h_B_RowIndices, d_B_RowIndices[0], (N + 1) * sizeof(int), cudaMemcpyDeviceToHost));
    printf("\n"); for (int i = 0; i <= N; i++) printf("h_B_RowIndices = %i\n", h_B_RowIndices[i]);

    gpuErrchk(cudaMalloc(&d_B[0], N * sizeof(double)));
    double *h_B = (double *)malloc(N * sizeof(double));
    setB<<<iDivUp(N, BLOCKSIZE), BLOCKSIZE>>>(d_B[0], N);

    gpuErrchk(cudaMemcpy(h_B, d_B[0], N * sizeof(double), cudaMemcpyDeviceToHost));
    printf("\n"); for (int i = 0; i < N; i++) printf("h_B = %f\n", h_B[i]);

    // --- Descriptor for sparse mutation matrix B
    cusparseSafeCall(cusparseCreateMatDescr(&descrB[0]));
    cusparseSafeCall(cusparseSetMatType     (descrB[0], CUSPARSE_MATRIX_TYPE_GENERAL));
    cusparseSafeCall(cusparseSetMatIndexBase(descrB[0], CUSPARSE_INDEX_BASE_ZERO));  
}

/*************************/
/* MATRIX ROW REORDERING */
/*************************/
void matrixRowReordering(const cusparseHandle_t handle, int nnzA, int nnzB, int *nnzC, cusparseMatDescr_t descrA, cusparseMatDescr_t descrB, 
                         cusparseMatDescr_t *descrC, double *d_A, int *d_A_RowIndices, int *d_A_ColIndices, double *d_B, int *d_B_RowIndices, 
                         int *d_B_ColIndices, double **d_C, int **d_C_RowIndices, int **d_C_ColIndices, const int N) {

    // --- Descriptor for sparse matrix C
    cusparseSafeCall(cusparseCreateMatDescr(&descrC[0]));
    cusparseSafeCall(cusparseSetMatType     (descrC[0], CUSPARSE_MATRIX_TYPE_GENERAL));
    cusparseSafeCall(cusparseSetMatIndexBase(descrC[0], CUSPARSE_INDEX_BASE_ZERO));  

    const int lda = N;                      // --- Leading dimension of dense matrix

    // --- Device side sparse matrix
    gpuErrchk(cudaMalloc(&d_C_RowIndices[0], (N + 1) * sizeof(int)));

    // --- Host side sparse matrices
    int *h_C_RowIndices = (int *)malloc((N + 1) * sizeof(int));

    // --- Performing the matrix - matrix multiplication
    int baseC;
    int *nnzTotalDevHostPtr = &nnzC[0]; 

    cusparseSafeCall(cusparseSetPointerMode(handle, CUSPARSE_POINTER_MODE_HOST));

    cusparseSafeCall(cusparseXcsrgemmNnz(handle, CUSPARSE_OPERATION_NON_TRANSPOSE, CUSPARSE_OPERATION_NON_TRANSPOSE, N, N, N, descrB, nnzB, 
                                         d_B_RowIndices, d_B_ColIndices, descrA, nnzA, d_A_RowIndices, d_A_ColIndices, descrC[0], d_C_RowIndices[0], 
                                         nnzTotalDevHostPtr));
    if (NULL != nnzTotalDevHostPtr) nnzC[0] = *nnzTotalDevHostPtr;
    else {
        gpuErrchk(cudaMemcpy(&nnzC[0],  d_C_RowIndices + N, sizeof(int), cudaMemcpyDeviceToHost));
        gpuErrchk(cudaMemcpy(&baseC,    d_C_RowIndices,     sizeof(int), cudaMemcpyDeviceToHost));
        nnzC -= baseC;
    }
    gpuErrchk(cudaMalloc(&d_C_ColIndices[0], nnzC[0] * sizeof(int)));
    gpuErrchk(cudaMalloc(&d_C[0], nnzC[0] * sizeof(double)));
    double *h_C = (double *)malloc(nnzC[0] * sizeof(double));       
    int *h_C_ColIndices = (int *)malloc(nnzC[0] * sizeof(int));
    cusparseSafeCall(cusparseDcsrgemm(handle, CUSPARSE_OPERATION_NON_TRANSPOSE, CUSPARSE_OPERATION_NON_TRANSPOSE, N, N, N, descrB, nnzB,
                                      d_B, d_B_RowIndices, d_B_ColIndices, descrA, nnzA, d_A, d_A_RowIndices, d_A_ColIndices, descrC[0],
                                      d_C[0], d_C_RowIndices[0], d_C_ColIndices[0]));

    double *h_C_dense = (double*)malloc(N * N * sizeof(double));
    double *d_C_dense;  gpuErrchk(cudaMalloc(&d_C_dense, N * N * sizeof(double)));
    cusparseSafeCall(cusparseDcsr2dense(handle, N, N, descrC[0], d_C[0], d_C_RowIndices[0], d_C_ColIndices[0], d_C_dense, N));

    gpuErrchk(cudaMemcpy(h_C ,           d_C[0],            nnzC[0] * sizeof(double), cudaMemcpyDeviceToHost));
    gpuErrchk(cudaMemcpy(h_C_RowIndices, d_C_RowIndices[0], (N + 1) * sizeof(int), cudaMemcpyDeviceToHost));
    gpuErrchk(cudaMemcpy(h_C_ColIndices, d_C_ColIndices[0], nnzC[0] * sizeof(int), cudaMemcpyDeviceToHost));

    printf("\nResult matrix C in CSR format\n\n");
    for (int i = 0; i < nnzC[0]; ++i) printf("C[%i] = %f ", i, h_C[i]); printf("\n");

    printf("\n");
    for (int i = 0; i < (N + 1); ++i) printf("h_C_RowIndices[%i] = %i \n", i, h_C_RowIndices[i]); printf("\n");

    printf("\n");
    for (int i = 0; i < nnzC[0]; ++i) printf("h_C_ColIndices[%i] = %i \n", i, h_C_ColIndices[i]);   

    gpuErrchk(cudaMemcpy(h_C_dense, d_C_dense, N * N * sizeof(double), cudaMemcpyDeviceToHost));

    for (int j = 0; j < N; j++) {
        for (int i = 0; i < N; i++) 
            printf("%f \t", h_C_dense[i * N + j]);
        printf("\n");
        }

}

/******************/
/* ROW REORDERING */
/******************/
void rowReordering(const cusparseHandle_t handle, int nnzA, cusparseMatDescr_t descrB, double *d_B, int *d_B_RowIndices, int *d_B_ColIndices, 
                   double *d_x_dense, double **d_y_dense, const int N) {

    gpuErrchk(cudaMalloc(&d_y_dense[0], N     * sizeof(double)));

    const double alpha = 1.;
    const double beta  = 0.;
    cusparseSafeCall(cusparseDcsrmv(handle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, N, nnzA, &alpha, descrB, d_B, d_B_RowIndices, d_B_ColIndices, d_x_dense, 
                                    &beta, d_y_dense[0]));

    double *h_y_dense = (double*)malloc(N *     sizeof(double));
    gpuErrchk(cudaMemcpy(h_y_dense,           d_y_dense[0],            N * sizeof(double), cudaMemcpyDeviceToHost));

    printf("\nResult vector\n\n");
    for (int i = 0; i < N; ++i) printf("h_y[%i] = %f ", i, h_y_dense[i]); printf("\n");

}

/*****************************/
/* SOLVING THE LINEAR SYSTEM */
/*****************************/
void LUDecomposition(const cusparseHandle_t handle, int nnzC, cusparseMatDescr_t descrC, double *d_C, int *d_C_RowIndices, int *d_C_ColIndices, 
                     double *d_x_dense, double **d_y_dense, const int N) {

    /******************************************/
    /* STEP 1: CREATE DESCRIPTORS FOR L AND U */
    /******************************************/
    cusparseMatDescr_t      descr_L = 0; 
    cusparseSafeCall(cusparseCreateMatDescr (&descr_L)); 
    cusparseSafeCall(cusparseSetMatIndexBase(descr_L, CUSPARSE_INDEX_BASE_ZERO)); 
    cusparseSafeCall(cusparseSetMatType     (descr_L, CUSPARSE_MATRIX_TYPE_GENERAL)); 
    cusparseSafeCall(cusparseSetMatFillMode (descr_L, CUSPARSE_FILL_MODE_LOWER)); 
    cusparseSafeCall(cusparseSetMatDiagType (descr_L, CUSPARSE_DIAG_TYPE_UNIT)); 

    cusparseMatDescr_t      descr_U = 0; 
    cusparseSafeCall(cusparseCreateMatDescr (&descr_U)); 
    cusparseSafeCall(cusparseSetMatIndexBase(descr_U, CUSPARSE_INDEX_BASE_ZERO)); 
    cusparseSafeCall(cusparseSetMatType     (descr_U, CUSPARSE_MATRIX_TYPE_GENERAL)); 
    cusparseSafeCall(cusparseSetMatFillMode (descr_U, CUSPARSE_FILL_MODE_UPPER)); 
    cusparseSafeCall(cusparseSetMatDiagType (descr_U, CUSPARSE_DIAG_TYPE_NON_UNIT)); 

    /**************************************************************************************************/
    /* STEP 2: QUERY HOW MUCH MEMORY USED IN LU FACTORIZATION AND THE TWO FOLLOWING SYSTEM INVERSIONS */
    /**************************************************************************************************/
    csrilu02Info_t info_C = 0; cusparseSafeCall(cusparseCreateCsrilu02Info  (&info_C)); 
    csrsv2Info_t info_L = 0;   cusparseSafeCall(cusparseCreateCsrsv2Info    (&info_L)); 
    csrsv2Info_t info_U = 0;   cusparseSafeCall(cusparseCreateCsrsv2Info    (&info_U)); 

    int pBufferSize_M, pBufferSize_L, pBufferSize_U; 
    cusparseSafeCall(cusparseDcsrilu02_bufferSize(handle, N, nnzC, descrC, d_C, d_C_RowIndices, d_C_ColIndices, info_C, &pBufferSize_M)); 
    cusparseSafeCall(cusparseDcsrsv2_bufferSize (handle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, nnzC, descr_L, d_C, d_C_RowIndices, d_C_ColIndices, info_L, &pBufferSize_L)); 
    cusparseSafeCall(cusparseDcsrsv2_bufferSize (handle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, nnzC, descr_U, d_C, d_C_RowIndices, d_C_ColIndices, info_U, &pBufferSize_U)); 

    int pBufferSize = max(pBufferSize_M, max(pBufferSize_L, pBufferSize_U)); 
    void *pBuffer = 0; gpuErrchk(cudaMalloc((void**)&pBuffer, pBufferSize)); 

    /************************************************************************************************/
    /* STEP 3: ANALYZE THE THREE PROBLEMS: LU FACTORIZATION AND THE TWO FOLLOWING SYSTEM INVERSIONS */
    /************************************************************************************************/
    int structural_zero; 

    cusparseSafeCall(cusparseDcsrilu02_analysis(handle, N, nnzC, descrC, d_C, d_C_RowIndices, d_C_ColIndices, info_C, CUSPARSE_SOLVE_POLICY_NO_LEVEL, pBuffer)); 
    cusparseStatus_t status = cusparseXcsrilu02_zeroPivot(handle, info_C, &structural_zero); 
    if (CUSPARSE_STATUS_ZERO_PIVOT == status){ printf("A(%d,%d) is missing\n", structural_zero, structural_zero); } 

    cusparseSafeCall(cusparseDcsrsv2_analysis(handle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, nnzC, descr_L, d_C, d_C_RowIndices, d_C_ColIndices, info_L, CUSPARSE_SOLVE_POLICY_NO_LEVEL, pBuffer)); 
    cusparseSafeCall(cusparseDcsrsv2_analysis(handle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, nnzC, descr_U, d_C, d_C_RowIndices, d_C_ColIndices, info_U, CUSPARSE_SOLVE_POLICY_USE_LEVEL, pBuffer)); 

    /************************************/
    /* STEP 4: FACTORIZATION: A = L * U */
    /************************************/
    int numerical_zero; 

    cusparseSafeCall(cusparseDcsrilu02(handle, N, nnzC, descrC, d_C, d_C_RowIndices, d_C_ColIndices, info_C, CUSPARSE_SOLVE_POLICY_NO_LEVEL, pBuffer)); 
    status = cusparseXcsrilu02_zeroPivot(handle, info_C, &numerical_zero); 
    if (CUSPARSE_STATUS_ZERO_PIVOT == status){ printf("U(%d,%d) is zero\n", numerical_zero, numerical_zero); } 

    /*********************/
    /* STEP 5: L * z = x */
    /*********************/
    // --- Allocating the intermediate result vector
    double *d_z_dense;      gpuErrchk(cudaMalloc(&d_z_dense, N * sizeof(double))); 

    const double alpha = 1.; 
    cusparseSafeCall(cusparseDcsrsv2_solve(handle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, nnzC, &alpha, descr_L, d_C, d_C_RowIndices, d_C_ColIndices, info_L, d_x_dense, d_z_dense, CUSPARSE_SOLVE_POLICY_NO_LEVEL, pBuffer)); 

    /*********************/
    /* STEP 5: U * y = z */
    /*********************/
    gpuErrchk(cudaMalloc(&d_y_dense[0], N * sizeof(double))); 
    cusparseSafeCall(cusparseDcsrsv2_solve(handle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, nnzC, &alpha, descr_U, d_C, d_C_RowIndices, d_C_ColIndices, info_U, d_z_dense, d_y_dense[0], CUSPARSE_SOLVE_POLICY_USE_LEVEL, pBuffer));

    double *h_y_dense = (double *)malloc(N * sizeof(double));
    gpuErrchk(cudaMemcpy(h_y_dense, d_y_dense[0], N * sizeof(double), cudaMemcpyDeviceToHost));
    printf("\n\nFinal result\n");
    for (int k=0; k<N; k++) printf("x[%i] = %f\n", k, h_y_dense[k]);

}

/********/
/* MAIN */
/********/
int main()
{
    // --- Initialize cuSPARSE
    cusparseHandle_t handle;    cusparseSafeCall(cusparseCreate(&handle));

    /*************************************************/
    /* SETTING UP THE ORIGINAL LINEAR SYSTEM PROBLEM */
    /*************************************************/
    const int N     = 4;                // --- Number of rows and columns

    double *h_A_dense;  double *h_x_dense;
    double *d_A_dense;  double *d_x_dense;
    setUpTheProblem(&h_A_dense, &h_x_dense, &d_A_dense, &d_x_dense, N);

    /************************/
    /* FROM DENSE TO SPARSE */
    /************************/
    //--- Descriptor for sparse matrix A
    cusparseMatDescr_t descrA;

    int *d_A_RowIndices, *d_A_ColIndices;   
    double *d_A;

    int nnzA;

    fromDenseToSparse(handle, d_A_dense, &d_A, &d_A_RowIndices, &d_A_ColIndices, &nnzA, &descrA, N);

    /******************/
    /* GRAPH COLORING */
    /******************/
    const double fractionToColor = 0.95;

    int *d_B_RowIndices, *d_B_ColIndices;   
    double *d_B;

    int nnzB;

    cusparseMatDescr_t descrB;      
    graphColoring(handle, nnzB, descrA, fractionToColor, d_A, d_A_RowIndices, d_A_ColIndices, &d_B, &d_B_RowIndices, &d_B_ColIndices, &descrB, N);

    /*************************/
    /* MATRIX ROW REORDERING */
    /*************************/
    int nnzC;

    int *d_C_RowIndices, *d_C_ColIndices;
    double *d_C;

    cusparseMatDescr_t descrC;
    matrixRowReordering(handle, nnzA, nnzB, &nnzC, descrA, descrB, &descrC, d_A, d_A_RowIndices, d_A_ColIndices, d_B, d_B_RowIndices, d_B_ColIndices, 
                        &d_C, &d_C_RowIndices, &d_C_ColIndices, N);

    /******************/
    /* ROW REORDERING */
    /******************/
    double *d_y_dense;
    rowReordering(handle, nnzA, descrB, d_B, d_B_RowIndices, d_B_ColIndices, d_x_dense, &d_y_dense, N);

    /*****************************/
    /* SOLVING THE LINEAR SYSTEM */
    /*****************************/
    double *d_xsol_dense;
    LUDecomposition(handle, nnzC, descrC, d_C, d_C_RowIndices, d_C_ColIndices, d_y_dense, &d_xsol_dense, N);

}

c++ - 改进稀疏线性系统的解

2 回答 2

Related

Reference