cuda - 使用 CUDA 进行行列式计算

Question

是否有任何库或免费可用的代码可以完全在 GPU 上计算小( )、双精度矩阵的行列式？6x6

score 4 · Accepted Answer

正如 Bart 在上面的评论中特别指出的那样，使用 GPU 来计算小矩阵（甚至其中许多矩阵）的行列式的计算并不能确保优于其他计算平台。

我认为计算矩阵行列式的问题本身是一个有趣的问题，在应用程序中可能会多次出现。目前，我不知道有任何库提供使用 CUDA 进行行列式计算的例程（也cuBLAS没有cuSOLVER这样的例程），因此您有两种可能性：

正如 Pavan 所指出的，实施您自己的方法；
设法使用其他可用的例程。

关于最后一点，一种可能性是使用Cholesky 分解，然后将行列式计算为 Cholesky 矩阵对角元素乘积的平方。在 Matlab 中，它将显示为：

prod(diag(chol(A)))^2

下面，我提供了一个利用这个想法的代码。特别是，Cholesky 分解是通过使用cuSOLVER'potrf函数执行的，而 Cholesky 矩阵对角线上的元素的乘积是使用 Thrust 进行跨步约简的应用。

下面的代码虽然适用于大型矩阵，但对于需要计算大型矩阵的行列式的人来说将立即有用。但是如何让它适应几个小矩阵呢？一种可能性是使用cuSOLVER' 流进行 Cholesky 分解，然后使用 Thurst 1.8 的新动态并行功能。请注意，从 CUDA 7.0 开始，cuSOLVER不允许使用动态并行。

这是代码：

#include "cuda_runtime.h"
#include "device_launch_paraMeters.h"

#include<iostream>
#include<iomanip>
#include<stdlib.h>
#include<stdio.h>
#include<assert.h>
#include<ostream>

#include <cusolverDn.h>
#include <cublas_v2.h>
#include <cuda_runtime_api.h>

#include "Utilities.cuh"

#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/functional.h>

#include <thrust/fill.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>

#include <thrust/copy.h>

/*************************/
/* STRIDED RANGE FUNCTOR */
/*************************/
template <typename Iterator>
class strided_range
{
    public:

    typedef typename thrust::iterator_difference<Iterator>::type difference_type;

    struct stride_functor : public thrust::unary_function<difference_type,difference_type>
    {
        difference_type stride;

        stride_functor(difference_type stride)
            : stride(stride) {}

        __host__ __device__
        difference_type operator()(const difference_type& i) const
        {
            return stride * i;
        }
    };

    typedef typename thrust::counting_iterator<difference_type>                   CountingIterator;
    typedef typename thrust::transform_iterator<stride_functor, CountingIterator> TransformIterator;
    typedef typename thrust::permutation_iterator<Iterator,TransformIterator>     PermutationIterator;

    // type of the strided_range iterator
    typedef PermutationIterator iterator;

    // construct strided_range for the range [first,last)
    strided_range(Iterator first, Iterator last, difference_type stride)
        : first(first), last(last), stride(stride) {}

    iterator begin(void) const
    {
        return PermutationIterator(first, TransformIterator(CountingIterator(0), stride_functor(stride)));
    }

    iterator end(void) const
    {
        return begin() + ((last - first) + (stride - 1)) / stride;
    }

    protected:
    Iterator first;
    Iterator last;
    difference_type stride;
};

int main(void)
{
    const int Nrows = 5;
    const int Ncols = 5;

    const int STRIDE = Nrows + 1;

    // --- Setting the host, Nrows x Ncols matrix
    double h_A[Nrows][Ncols] = { 
        { 2.,    -2.,    -2.,    -2.,    -2.,},  
        {-2.,     4.,     0.,     0.,     0.,}, 
        {-2.,     0.,     6.,     2.,     2.,}, 
        {-2.,     0.,     2.,     8.,     4.,}, 
        {-2.,     0.,     2.,     4.,     10.,}
    };

    // --- Setting the device matrix and moving the host matrix to the device
    double *d_A;            gpuErrchk(cudaMalloc(&d_A,      Nrows * Ncols * sizeof(double)));
    gpuErrchk(cudaMemcpy(d_A, h_A, Nrows * Ncols * sizeof(double), cudaMemcpyHostToDevice));

    // --- cuSOLVE input/output parameters/arrays
    int work_size = 0;
    int *devInfo;           gpuErrchk(cudaMalloc(&devInfo,          sizeof(int)));

    // --- CUDA solver initialization
    cusolverDnHandle_t solver_handle;
    cusolverDnCreate(&solver_handle);

    // --- CUDA CHOLESKY initialization
    cusolveSafeCall(cusolverDnDpotrf_bufferSize(solver_handle, CUBLAS_FILL_MODE_LOWER, Nrows, d_A, Nrows, &work_size));

    // --- CUDA POTRF execution
    double *work;   gpuErrchk(cudaMalloc(&work, work_size * sizeof(double)));
    cusolveSafeCall(cusolverDnDpotrf(solver_handle, CUBLAS_FILL_MODE_LOWER, Nrows, d_A, Nrows, work, work_size, devInfo));
    int devInfo_h = 0;  gpuErrchk(cudaMemcpy(&devInfo_h, devInfo, sizeof(int), cudaMemcpyDeviceToHost));
    if (devInfo_h != 0) std::cout   << "Unsuccessful potrf execution\n\n";

    cusolverDnDestroy(solver_handle);

    // --- Strided reduction of the elements of d_A: calculating the product of the diagonal of the Cholesky factorization  
    thrust::device_ptr<double> dev_ptr = thrust::device_pointer_cast(d_A);
    typedef thrust::device_vector<double>::iterator Iterator;
    strided_range<Iterator> pos(dev_ptr, dev_ptr + Nrows * Ncols, STRIDE);

    double det = thrust::reduce(pos.begin(), pos.end(), 1., thrust::multiplies<double>());
    det  = det * det;

    printf("determinant = %f\n", det);

    return 0;
}

score 4 · Accepted Answer

这是计划，您将需要缓冲 100 个这些微小矩阵并启动内核一次以一次计算所有这些矩阵的行列式。

我不会编写实际的代码，但这应该会有所帮助。

1) 启动 # 块 = # 矩阵。每个块计算每个矩阵的行列式。

2) det(A) = det(A11 * A22 - A21 * A12); 其中 A 是 6x6，A11、A12、A21、A22 是 A 的 3x3 子矩阵。

3) 编写一个对 3x3 矩阵进行矩阵乘法的设备函数

4) 3x3 矩阵的 det 计算简单：使用此处的公式。

编辑：显然（2）仅在 A21 * A12 == A12 * A21 时有效

一种替代方法如下

1)对每个 6x6 矩阵通过高斯消元法进行 LU 分解

2) 将 U 的对角元素相乘得到行列式。

score -2 · Accepted Answer

您可以使用 OpenCL 或 CUDA 作为库并编写一个简短的程序（OpenCL 中的内核）来计算 GPU 上的行列式

CUDA http://www.nvidia.de/object/cuda_home_new_de.html

OpenCL http://www.khronos.org/opencl/

http://www.csd.uwo.ca/~moreno/Publications/DetHpcsPaper-proceedings.pdf

本文应包含 CUDA 的伪代码

cuda - 使用 CUDA 进行行列式计算

3 回答 3

Related

Reference