c - 使用 CUDA 减少矩阵行

Question

Windows 7, NVidia GeForce 425M.

我写了一个简单的 CUDA 代码来计算矩阵的行和。矩阵具有一维表示（指向浮点数的指针）。

代码的串行版本如下（它有2循环，正如预期的那样）：

void serial_rowSum (float* m, float* output, int nrow, int ncol) {
    float sum;
    for (int i = 0 ; i < nrow ; i++) {
        sum = 0;
        for (int j = 0 ; j < ncol ; j++)
            sum += m[i*ncol+j];
        output[i] = sum;
    }
}

在 CUDA 代码中，我调用了按行扫描矩阵的核函数。下面是内核调用片段：

dim3 threadsPerBlock((unsigned int) nThreadsPerBlock); // has to be multiple of 32
dim3 blocksPerGrid((unsigned int) ceil(nrow/(float) nThreadsPerBlock)); 

kernel_rowSum<<<blocksPerGrid, threadsPerBlock>>>(d_m, d_output, nrow, ncol);

和执行行的并行求和的内核函数（仍然有1循环）：

__global__ void kernel_rowSum(float *m, float *s, int nrow, int ncol) {

    int rowIdx = threadIdx.x + blockIdx.x * blockDim.x;

    if (rowIdx < nrow) {
        float sum=0;
        for (int k = 0 ; k < ncol ; k++)
            sum+=m[rowIdx*ncol+k];
        s[rowIdx] = sum;            
    }

}

到现在为止还挺好。串行和并行 (CUDA) 结果是相等的。

关键是 CUDA 版本的计算时间几乎是串行版本的两倍，即使我更改了nThreadsPerBlock参数：我测试nThreadsPerBlock了从32到1024（我的卡允许的每个块的最大线程数）。

IMO，矩阵维度足够大，可以证明并行化：90,000 x 1,000.

下面，我报告了使用不同nThreadsPerBlock. msec超过平均100样本报告的时间：

矩阵：nrow = 90000 x ncol = 1000

序列：每个样本经过的平均时间（以毫秒为单位）（100样本）289.18：.

CUDA（32ThreadsPerBlock）：每个样本经过的平均时间（以毫秒为单位）（100样本）497.11：.

CUDA（1024ThreadsPerBlock）：每个样本经过的平均时间（以毫秒为单位）（100样本）699.66：.

以防万一，带有32/的版本1024 nThreadsPerBlock是最快/最慢的版本。

我知道从主机复制到设备时会产生某种开销，反之亦然，但速度缓慢可能是因为我没有实现最快的代码。

由于我远非 CUDA 专家：

我是否为此任务编写了最快的版本？我怎样才能改进我的代码？我可以摆脱内核函数中的循环吗？

任何想法表示赞赏。

编辑 1

尽管我描述了一个标准rowSum，但我对具有值的行的AND/操作感兴趣，例如/ 。也就是说，正如一些评论员所建议的那样，它不允许我利用乘以的列向量技巧。OR(0;1}rowANDrowORcuBLAS1COL

编辑 2

正如用户所建议的，其他用户在这里得到了认可：

忘记尝试编写自己的函数，改用 Thrust 库，魔法就来了。

score 15 · Accepted Answer

既然你提到你需要一般的减少算法，而不是只求和。我将在这里尝试给出 3 种方法。内核方法可能具有最高的性能。推力方法最容易实施。cuBLAS 方法仅适用于 sum 并且具有良好的性能。

内核方法

这是一个非常好的文档，介绍了如何优化标准并行缩减。标准降低可分为2个阶段。

多个线程块，每个减少一部分数据；
一个线程块从阶段 1 的结果减少到最后的 1 元素。

对于您的多重减少（减少垫子行）问题，只有第 1 阶段就足够了。这个想法是每个线程块减少 1 行。有关每个线程块多行或每个多线程块 1 行等进一步考虑，您可以参考@Novak 提供的论文。这可能会进一步提高性能，尤其是对于形状不好的矩阵。

推力法

一般的多重还原可以thrust::reduction_by_key在几分钟内完成。您可以在此处找到一些讨论，使用 CUDA Thrust 确定每个矩阵列中的最小元素及其位置。

但是thrust::reduction_by_key并不假设每一行都具有相同的长度，因此您将获得性能损失。另一篇文章如何以最大性能规范化 CUDA 中的矩阵列？thrust::reduction_by_key给出了行总和与 cuBLAS 方法之间的分析比较。它可能会让您对性能有一个基本的了解。

cuBLAS 方法

矩阵 A 的行/列之和可以看作是矩阵-向量乘法，其中向量的元素都是 1。可以用下面的matlab代码表示。

y = A * ones(size(A,2),1);

其中y是 A 的行的总和。

cuBLAS 库为此操作提供了一个高性能的矩阵向量乘法函数cublas<t>gemv()。

时序结果表明，该例程仅比一次简单地读取 A 的所有元素慢 10~50%，可以看作是该操作的理论性能上限。

score 7 · Accepted Answer

减少矩阵的行可以通过三种方式使用CUDA Thrust来解决（它们可能不是唯一的，但解决这一点超出了范围）。正如同一 OP 所承认的那样，对于此类问题，最好使用 CUDA Thrust。此外，使用cuBLAS的方法也是可能的。

方法 #1 -reduce_by_key

这是此Thrust 示例页面中建议的方法。它包括一个使用make_discard_iterator.

方法 #2 -transform

这是 CUDA Thrust 的 Robert Crovella 建议的方法：reduce_by_key 仅针对数组中的某些值，基于“键”数组中的值。

方法#3 -inclusive_scan_by_key

这是 Eric 在How to normalize matrix columns in CUDA with max performance? 中建议的方法？.

方法#4 -cublas<t>gemv

它用于cuBLAS gemv将相关矩阵乘以1's 列。

完整代码

这是浓缩这两种方法的代码。Utilities.cu和Utilities.cuh文件在此处保留，此处省略。TimingGPU.cu和在这里TimingGPU.cuh维护，也被省略了。

#include <cublas_v2.h>

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/reduce.h>
#include <thrust/functional.h>
#include <thrust/random.h>
#include <thrust/sequence.h>

#include <stdio.h>
#include <iostream>

#include "Utilities.cuh"
#include "TimingGPU.cuh"

// --- Required for approach #2
__device__ float *vals;

/**************************************************************/
/* CONVERT LINEAR INDEX TO ROW INDEX - NEEDED FOR APPROACH #1 */
/**************************************************************/
template <typename T>
struct linear_index_to_row_index : public thrust::unary_function<T,T> {

    T Ncols; // --- Number of columns

    __host__ __device__ linear_index_to_row_index(T Ncols) : Ncols(Ncols) {}

    __host__ __device__ T operator()(T i) { return i / Ncols; }
};

/******************************************/
/* ROW_REDUCTION - NEEDED FOR APPROACH #2 */
/******************************************/
struct row_reduction {

    const int Ncols;    // --- Number of columns

    row_reduction(int _Ncols) : Ncols(_Ncols) {}

    __device__ float operator()(float& x, int& y ) {
        float temp = 0.f;
        for (int i = 0; i<Ncols; i++)
            temp += vals[i + (y*Ncols)];
        return temp;
    }
};

/**************************/
/* NEEDED FOR APPROACH #3 */
/**************************/
template<typename T>
struct MulC: public thrust::unary_function<T, T>
{
    T C;
    __host__ __device__ MulC(T c) : C(c) { }
    __host__ __device__ T operator()(T x) { return x * C; }
};

/********/
/* MAIN */
/********/
int main()
{
    const int Nrows = 5;     // --- Number of rows
    const int Ncols = 8;     // --- Number of columns

    // --- Random uniform integer distribution between 10 and 99
    thrust::default_random_engine rng;
    thrust::uniform_int_distribution<int> dist(10, 99);

    // --- Matrix allocation and initialization
    thrust::device_vector<float> d_matrix(Nrows * Ncols);
    for (size_t i = 0; i < d_matrix.size(); i++) d_matrix[i] = (float)dist(rng);

    TimingGPU timerGPU;

    /***************/
    /* APPROACH #1 */
    /***************/
    timerGPU.StartCounter();
    // --- Allocate space for row sums and indices
    thrust::device_vector<float> d_row_sums(Nrows);
    thrust::device_vector<int> d_row_indices(Nrows);

    // --- Compute row sums by summing values with equal row indices
    //thrust::reduce_by_key(thrust::make_transform_iterator(thrust::counting_iterator<int>(0), linear_index_to_row_index<int>(Ncols)),
    //                    thrust::make_transform_iterator(thrust::counting_iterator<int>(0), linear_index_to_row_index<int>(Ncols)) + (Nrows*Ncols),
    //                    d_matrix.begin(),
    //                    d_row_indices.begin(),
    //                    d_row_sums.begin(),
    //                    thrust::equal_to<int>(),
    //                    thrust::plus<float>());

    thrust::reduce_by_key(
                thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)),
                thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)) + (Nrows*Ncols),
                d_matrix.begin(),
                thrust::make_discard_iterator(),
                d_row_sums.begin());

    printf("Timing for approach #1 = %f\n", timerGPU.GetCounter());

    // --- Print result
    for(int i = 0; i < Nrows; i++) {
        std::cout << "[ ";
        for(int j = 0; j < Ncols; j++)
            std::cout << d_matrix[i * Ncols + j] << " ";
        std::cout << "] = " << d_row_sums[i] << "\n";
    }

    /***************/
    /* APPROACH #2 */
    /***************/
    timerGPU.StartCounter();
    thrust::device_vector<float> d_row_sums_2(Nrows, 0);
    float *s_vals = thrust::raw_pointer_cast(&d_matrix[0]);
    gpuErrchk(cudaMemcpyToSymbol(vals, &s_vals, sizeof(float *)));
    thrust::transform(d_row_sums_2.begin(), d_row_sums_2.end(), thrust::counting_iterator<int>(0),  d_row_sums_2.begin(), row_reduction(Ncols));

    printf("Timing for approach #2 = %f\n", timerGPU.GetCounter());

    for(int i = 0; i < Nrows; i++) {
        std::cout << "[ ";
        for(int j = 0; j < Ncols; j++)
            std::cout << d_matrix[i * Ncols + j] << " ";
        std::cout << "] = " << d_row_sums_2[i] << "\n";
    }

    /***************/
    /* APPROACH #3 */
    /***************/

    timerGPU.StartCounter();
    thrust::device_vector<float> d_row_sums_3(Nrows, 0);
    thrust::device_vector<float> d_temp(Nrows * Ncols);
    thrust::inclusive_scan_by_key(
                thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)),
                thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)) + (Nrows*Ncols),
                d_matrix.begin(),
                d_temp.begin());
    thrust::copy(
                thrust::make_permutation_iterator(
                        d_temp.begin() + Ncols - 1,
                        thrust::make_transform_iterator(thrust::make_counting_iterator(0), MulC<int>(Ncols))),
    thrust::make_permutation_iterator(
                        d_temp.begin() + Ncols - 1,
                        thrust::make_transform_iterator(thrust::make_counting_iterator(0), MulC<int>(Ncols))) + Nrows,
                d_row_sums_3.begin());

    printf("Timing for approach #3 = %f\n", timerGPU.GetCounter());

    for(int i = 0; i < Nrows; i++) {
        std::cout << "[ ";
        for(int j = 0; j < Ncols; j++)
            std::cout << d_matrix[i * Ncols + j] << " ";
        std::cout << "] = " << d_row_sums_3[i] << "\n";
    }

    /***************/
    /* APPROACH #4 */
    /***************/
    cublasHandle_t handle;

    timerGPU.StartCounter();
    cublasSafeCall(cublasCreate(&handle));

    thrust::device_vector<float> d_row_sums_4(Nrows);
    thrust::device_vector<float> d_ones(Ncols, 1.f);

    float alpha = 1.f;
    float beta  = 0.f;
    cublasSafeCall(cublasSgemv(handle, CUBLAS_OP_T, Ncols, Nrows, &alpha, thrust::raw_pointer_cast(d_matrix.data()), Ncols, 
                               thrust::raw_pointer_cast(d_ones.data()), 1, &beta, thrust::raw_pointer_cast(d_row_sums_4.data()), 1));

    printf("Timing for approach #4 = %f\n", timerGPU.GetCounter());

    for(int i = 0; i < Nrows; i++) {
        std::cout << "[ ";
        for(int j = 0; j < Ncols; j++)
            std::cout << d_matrix[i * Ncols + j] << " ";
        std::cout << "] = " << d_row_sums_4[i] << "\n";
    }

    return 0;
}

计时结果（在 Kepler K20c 上测试）

Matrix size       #1     #1-v2     #2     #3     #4     #4 (no plan)
100  x 100        0.63   1.00     0.10    0.18   139.4  0.098
1000 x 1000       1.25   1.12     3.25    1.04   101.3  0.12
5000 x 5000       8.38   15.3     16.05   13.8   111.3  1.14

 100 x 5000       1.25   1.52     2.92    1.75   101.2  0.40    

5000 x 100        1.35   1.99     0.37    1.74   139.2  0.14

似乎方法#1 和#3 优于方法#2，除了在少量列的情况下。然而，最好的方法是方法#4，它比其他方法更方便，前提是创建计划所需的时间可以在计算过程中摊销。

score 4 · Accepted Answer

如果这是您需要对这些数据进行的操作的范围（对行求和），我不希望 GPU 能带来可观的收益。每个数据元素只有一个算术运算，为此您需要支付将该数据元素传输到 GPU 的成本。超过一定的问题规模（无论需要什么让机器保持忙碌）你都不会从更大的问题规模中获得额外的好处，因为算术强度是 O(n)。

所以这不是在 GPU 上解决的特别令人兴奋的问题。

但正如 talonmies 所指出的，你在制作它的方式上有一个合并问题，这将进一步减慢速度。我们来看一个小例子：

    C1  C2  C3  C4
R1  11  12  13  14
R2  21  22  23  24
R3  31  32  33  34
R4  41  42  43  44

以上是矩阵一小部分的简单图示示例。机器数据存储使得元素(11)、(12)、(13)和(14)存储在相邻的存储位置中。

对于合并访问，我们需要一种访问模式，以便从同一条指令中请求相邻的内存位置，并在整个 warp 中执行。

我们需要从warp的角度来考虑代码的执行，即 32 个线程在锁步中执行。你的代码在做什么？它在每个步骤/指令中检索（要求）哪些元素？我们来看看这行代码：

        sum+=m[rowIdx*ncol+k];

rowIdx当您创建该变量时，经线中的相邻线程具有相邻（即连续）值。那么当k= 0 时，当我们尝试检索值时，每个线程会要求哪个数据元素m[rowIdx*ncol+k]？

在块 0 中，线程 0 的 arowIdx为 0。线程 1 的 arowIdx为 1，依此类推。因此，在此指令中每个线程要求的值是：

Thread:   Memory Location:    Matrix Element:
     0      m[0]                   (11)
     1      m[ncol]                (21)
     2      m[2*ncol]              (31)
     3      m[3*ncol]              (41)

但这不是合并访问！元素 (11)、(21) 等在内存中不相邻。对于合并访问，我们希望 Matrix Element 行如下所示：

Thread:   Memory Location:    Matrix Element:
     0      m[?]                   (11)
     1      m[?]                   (12)
     2      m[?]                   (13)
     3      m[?]                   (14)

如果您然后向后工作以确定?应该是什么值，您将提出如下指令：

        sum+=m[k*ncol+rowIdx];

这将提供合并访问，但它不会给您正确的答案，因为我们现在对矩阵列求和而不是矩阵行。我们可以通过将数据存储重新组织为列优先顺序而不是行优先顺序来解决此问题。（你应该可以用谷歌搜索，对吧？）从概念上讲，这相当于转置你的矩阵m. 在我看来，这对您来说是否方便不在您的问题范围之内，也不是真正的 CUDA 问题。当您在主机上创建矩阵或将矩阵从主机传输到设备时，这对您来说可能是一件简单的事情。但总而言之，如果矩阵以行优先顺序存储，我不知道有一种方法可以对具有 100% 合并访问的矩阵行求和。（你可以诉诸一系列的行缩减，但这对我来说看起来很痛苦。）

当我们考虑在 GPU 上加速代码的方法时，考虑重新组织我们的数据存储以促进 GPU 的使用并不少见。这是一个例子。

而且，是的，我在这里概述的内容仍然在内核中保留了一个循环。

作为附加评论，我建议分别计时数据复制部分和内核（计算）部分。我无法从您的问题中判断您是仅计时内核还是计时整个（GPU）操作，包括数据副本。如果您单独对数据复制计时，您可能会发现只是数据复制时间超过了您的 CPU 时间。任何优化 CUDA 代码的努力都不会影响数据复制时间。在您花费大量时间之前，这可能是一个有用的数据点。

c - 使用 CUDA 减少矩阵行

编辑 1

编辑 2

3 回答 3

内核方法

推力法

cuBLAS 方法

Related

Reference