python - 使用 Python 和 numba-pro 的 CUDA 内核中的数组

Question

我目前正在编写可以使用 GPU 高度并行化的代码。我的代码结构基本上是这样的：

创建两个数组，我们称它们为长度为 N 的 A 和 B。（CPU）
执行最终返回标量的 NxN 计算。这些计算仅取决于 A 和 B，因此可以并行化。(图形处理器)
将所有这些标量收集在一个列表中并取最小的一个。（中央处理器）
使用此标量 (CPU) 修改 A 和 B
返回第 2 步并重复，直到满足某个条件。

大多数示例都非常具有说明性，但它们似乎都像这样工作：在 CPU 上执行代码的主要部分，并且只在 GPU 上执行中间矩阵乘法等。特别是主机通常知道内核将要使用的所有变量。

对我来说恰恰相反，我想在 GPU 上执行代码的主要部分，而在 CPU 本身上只执行非常少量的步骤。我的主人对我的个人线程内部发生的事情一无所知。它只管理标量列表以及我的数组 A 和 B。

因此，我的问题是：

如何在内核中正确定义变量？特别是，我如何定义和初始化数组/列表？
如何编写返回数组的设备函数？（以下 MatrixMultiVector 不起作用）
为什么我不能在 CUDA 内核中使用 numpy 和其他库？我有什么选择？

我目前拥有的一个示例如下所示：

from __future__ import division
import numpy as np
from numbapro import *


# Device Functions
#----------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Works and can be called corrently from TestKernel Scalar
@cuda.jit('float32(float32, float32)', device=True)
def myfuncScalar(a, b):
    return a+b;


# Works and can be called correctly from TestKernel Array
@cuda.jit('float32[:](float32[:])', device=True)
def myfuncArray(A):
    for k in xrange(4):
        A[k] += 2*k;
    return A


# Takes Matrix A and Vector v, multiplies them and returns a vector of shape v. Does not even compile.
# Failed at nopython (nopython frontend), Only accept returning of array passed into the function as argument
# But v is passed to the function as argument...

@cuda.jit('float32[:](float32[:,:], float32[:])', device=True)
def MatrixMultiVector(A,v):
    tmp = cuda.local.array(shape=4, dtype=float32); # is that thing even empty? It could technically be anything, right?
    for i in xrange(A[0].size):
        for j in xrange(A[1].size):
            tmp[i] += A[i][j]*v[j];
    v = tmp;
    return v;



# Kernels
#----------------------------------------------------------------------------------------------------------------------------------------------------------------------

# TestKernel Scalar - Works
@cuda.jit(void(float32[:,:]))
def TestKernelScalar(InputArray):
    i = cuda.grid(1)
    for j in xrange(InputArray[1].size):
        InputArray[i,j] = myfuncScalar(5,7);


# TestKernel Array
@cuda.jit(void(float32[:,:]))
def TestKernelArray(InputArray):

    # Defining arrays this way seems super tedious, there has to be a better way.
    M = cuda.local.array(shape=4, dtype=float32);
    M[0] = 1; M[1] = 0; M[2] = 0; M[3] = 0;

    tmp = myfuncArray(M);
    #tmp = MatrixMultiVector(A,M); -> we still have to define a 4x4 matrix for that.

    i = cuda.grid(1)
    for j in xrange(InputArray[1].size):
        InputArray[i,j] += tmp[j];

#----------------------------------------------------------------------------------------------------------------------------------------------------------------------
# Main
#----------------------------------------------------------------------------------------------------------------------------------------------------------------------

N = 4;

C = np.zeros((N,N), dtype=np.float32);
TestKernelArray[1,N](C);

print(C)

score 1 · Accepted Answer

简短的回答是您不能在 CUDA Python 中定义动态列表或数组。您可以拥有静态定义的本地或共享内存数组（请参阅文档中的cuda.local.array()和cuda.shared.array），但它们具有线程或块范围，并且在其关联的线程或块退役后无法重用。但这就是支持的所有内容。您可以将外部定义的数组传递给内核，但它们的属性是只读的。
根据您的说法，myfuncArray您可以返回一个外部定义的数组。您不能返回动态定义的数组，因为内核不支持动态定义的数组（或任何对象）。
您可以自己阅读CUDA Python 规范，但真正简短的回答是 CUDA Python 是 Numba 的No Python Mode的超集，虽然有可用的基本标量函数，但没有 Python 对象模型支持。这不包括很多 Python 功能，包括对象和 numpy。

python - 使用 Python 和 numba-pro 的 CUDA 内核中的数组

1 回答 1

Related

Reference