halide - Halide CUDA GPU SGEMM 实现

Question

我正在尝试构建基于卤化物的图像处理算法，该算法在其中一个阶段需要 SGEMM 功能。

我发现 Halide 有两种矩阵乘法实现：

线性代数算法（apps/linear_algebra 文件夹）
CUDA 矩阵乘法应用程序（apps/cuda_mat_mul 文件夹）

对于大小为 1024x1024 的矩阵：

首先，它们在 CPU (Intel i7) 和 Fermi GPU (GF 540M) 上运行良好，CPU 时间接近 OpenBlas，Fermi GPU 时间接近 cuBlas（约 18 毫秒），但这种实现比 Maxwell 上的 cuBlas 慢 10 倍GPU (TitanX) - 5 毫秒与 0.4 毫秒。与 Fermi 上的 cuBlas 相比，第二个实现 (cuda_mat_mul) 慢 3 倍 - 大约 57 毫秒 vs 18 毫秒，Maxwell GPU 上慢 2 倍 vs cuBlas - 1 毫秒 vs 0.4 毫秒

正如我所见 - Halide 可以为 Fermi GPU 生成最佳代码，但无法在 Maxwell 上快速运行。我知道，SGEMM 函数是许多 FusedMultiplyAdd 并具有正确的调度，但我找不到任何可以使其在 Maxwell 上快速运行的最佳调度。

我能想象的最快的卤化物代码放在 cuda_mat_mul 文件夹中，时间表是：

    Func prod("prod");
    RDom r(0, size);
    prod(x, y) += A(x, r) * B(r, y);

    Var xi, yi, xio, xii, yii, xo;
    Func out = prod.in();
    out.bound(x, 0, size)
        .bound(y, 0, size)
        .tile(x, y, xi, yi, 8*32, 8)
        .split(xi, xio, xii, 32)
        .reorder(xio, yi, xii, x, y)
        .unroll(xio)
        .unroll(yi)
        .gpu_blocks(x, y).gpu_threads(xii);
    prod.compute_at(out, xii)
        .unroll(x)
        .unroll(y)
        .update()
        .unroll(r.x, 2)
        .reorder(y, x, r.x)
        .unroll(x)
        .unroll(y);
    B.in()
        .compute_at(prod, y)
        .vectorize(B.in().args()[0])
            ;

我也尝试过使用更大的矩阵（2048x2048） - 图片看起来很相似：

cuBlas时间：0.003174
卤化物 linalg SGEMM 时间：0.042568
卤化物 cuda_mat_mul 时间：0.006792

基准测试代码来自 apps/cuda_mat_mul/runner.cpp，但将迭代次数从 10 更改为 100 以获得更精确的计时

如何更改时间表以使其与 TitanX 上接近 cuBlas 的性能一起工作？

更新：在 Ubuntu 16.4、LLVM 3.8、Halide 上进行测试——最新来自 git、Cuda 8

halide - Halide CUDA GPU SGEMM 实现

0 回答 0

Related

Reference