python - Numba 矩阵向量乘法

Question

我正在尝试使用 numbapro 在下面编写一个简单的矩阵向量乘法：

from numbapro import cuda
from numba import *
import numpy as np
import math
from timeit import default_timer as time

n = 100

@cuda.jit('void(float32[:,:], float32[:], float32[:])')
def cu_matrix_vector(A, b, c):
    y, x = cuda.grid(2)
    if y < n:
        c[y] = 0.0

    if x < n and y < n:
        for i in range(n):
            c[y] += A[y, i] * b[i]


A = np.array(np.random.random((n, n)), dtype=np.float32)
B = np.array(np.random.random((n, 1)), dtype=np.float32)
C = np.empty_like(B)

s = time()
dA = cuda.to_device(A)
dB = cuda.to_device(B)
dC = cuda.to_device(C)
cu_matrix_vector(dA, dB, dC)
dC.to_host()

e = time()
tcuda = e - s

但我收到以下错误：

numbapro.cudadrv.error.CudaDriverError: CUDA_ERROR_LAUNCH_FAILED 无法复制内存 D->H

我不明白为什么托管副本的设备失败了。请帮忙

score 6 · Accepted Answer

您的代码有多个问题。

B 和 C 向量是 Nx1 2D 矩阵，而不是 1D 向量，但内核的类型签名将它们列为“float32[:]”——1D 向量。它还使用单个索引对它们进行索引，这会由于访问未对齐而导致 GPU 上的运行时错误（cuda-memcheck你的朋友在这里！）
您的内核假定为 2D 网格，但仅使用其中的 1 列——这意味着许多线程执行相同的计算并相互覆盖。
没有给出执行配置，因此 NumbaPro 正在启动一个具有 1 个线程块的内核。（nvprof这里是你的朋友吗！）

这是一个有效的代码。请注意，这使用 1D 块的 1D 网格，并在矩阵的列上循环。因此，它针对向量/矩阵中的行数很大的情况进行了优化。针对短而宽的矩阵进行优化的内核需要使用另一种方法（并行缩减）。但我会改用 CUBLAS sgemv（也暴露在 NumbaPro 中）。

from numbapro import cuda
from numba import *
import numpy as np
import math
from timeit import default_timer as time

m = 100000 
n = 100

@cuda.jit('void(f4[:,:], f4[:], f4[:])')
def cu_matrix_vector(A, b, c):
    row = cuda.grid(1)
    if (row < m):
        sum = 0

        for i in range(n):
            sum += A[row, i] * b[i]

        c[row] = sum

A = np.array(np.random.random((m, n)), dtype=np.float32)
B = np.array(np.random.random(m), dtype=np.float32)
C = np.empty_like(B)

s = time()
dA = cuda.to_device(A)
dB = cuda.to_device(B)
dC = cuda.to_device(C)

cu_matrix_vector[(m+511)/512, 512](dA, dB, dC)

dC.to_host()

print C

e = time()
tcuda = e - s

python - Numba 矩阵向量乘法

1 回答 1

Related

Reference