numpy - 在 Numpy 数组上使用 Pycuda 的 GPU 数组乘法

Question

我试图通过制作类似的 GPU 数组并执行操作来实现两个 numpy 数组的元素乘法。但是，生成的执行时间比原始的 numpy pointwise multiplication 慢得多。我希望使用 GPU 获得良好的加速。zz0 是 complex128 类型，(64,256,16) 形状 numpy 数组，xx0 是 float64 类型，(16,151) 形状 numpy 数组。有人可以帮我弄清楚我在实施方面做错了什么：

import sys
import numpy as np
import matplotlib.pyplot as plt
import pdb
import time

import pycuda.driver as drv
import pycuda.autoinit
from pycuda.compiler import SourceModule
from pycuda.elementwise import ElementwiseKernel
import pycuda.gpuarray as gpuarray
import pycuda.cumath
import skcuda.linalg as linalg

linalg.init()

# Function for doing a point-wise multiplication using GPU
def calc_Hyp(zz,xx):
    zz_stretch = np.tile(zz, (1,1,1,xx.shape[3]))
    xx_stretch = np.tile(xx, (zz.shape[0],zz.shape[1],1,1))
    zzg = gpuarray.to_gpu(zz_stretch)
    xxg = gpuarray.to_gpu(xx_stretch)
    zz_Hypg = linalg.multiply(zzg,xxg)
    zz_Hyp = zz_Hypg.get()
    return zz_Hyp


zz0 = np.random.uniform(10.0/5000, 20000.0/5000, (64,256,16)).astype('complex128')
xx0 = np.random.uniform(10.0/5000, 20000.0/5000, (16,151)).astype('float64')

xx0_exp = np.exp(-1j*xx0)

t1 = time.time()

#Using GPU for the calculation
zz0_Hyp = calc_Hyp(zz0[:,:,:,None],xx0_exp[None,None,:,:])
#np.save('zz0_Hyp',zz0_Hyp)

t2 = time.time()
print('Time taken with GPU:{}'.format(t2-t1))

#Original calculation
zz0_Hyp_actual = zz0[:,:,:,None]*xx0_exp[None,None,:,:]
#np.save('zz0_Hyp_actual',zz0_Hyp_actual)

t3 = time.time()
print('Time taken without GPU:{}'.format(t3-t2))

score 2 · Accepted Answer

第一个问题是您的计时指标不准确。

Linalg 即时编译 cuda 模块，您可能会在运行时看到正在编译的代码。我对您的代码进行了一些细微的修改，以减少被相乘的数组的大小，但无论如何，在没有其他改进的两次运行之后，我看到了性能的巨大提升，例如：

Time taken with GPU:2.5476348400115967
Time taken without GPU:0.16627931594848633

对比

Time taken with GPU:0.8741757869720459
Time taken without GPU:0.15836167335510254

但是，这仍然比 CPU 版本慢得多。我做的下一件事是根据实际计算发生的位置给出更准确的时间。你没有在你的 numpy 版本中平铺，所以不要在你的 cuda 版本中计时：

REAL Time taken with GPU:0.6461708545684814

您还可以复制到 GPU，并将其包含在计算中，但这本身需要相当长的时间，所以让我们删除它：

t1 = time.time()
zz_Hypg = linalg.multiply(zzg,xxg)
t2 = time.time()
...
REAL Time taken with GPU:0.3689603805541992

哇，贡献很大。但是我们仍然比 numpy 版本慢？为什么？

还记得我说过 numpy 不会平铺吗？它根本不会复制内存以进行广播。要获得真正的速度，您必须：

不是瓷砖
广播维度
在内核中实现它。

Pycuda 提供了内核实现的实用程序，但它的 GPU 阵列不提供广播。基本上你必须做的是这个（免责声明：我没有测试过这个，可能有错误，这只是为了大致展示内核应该是什么样子）：

#include <pycuda-complex.hpp>
//KERNEL CODE
constexpr unsigned work_tile_dim = 32
//instruction level parallelism factor, how much extra work to do per thread, may be changed but effects the launch dimensions. thread group size should be (tile_factor, tile_factor/ilp_factor)
constexpr unsigned ilp_factor = 4
//assuming c order: 
//    x axis contiguous out, 
//    y axis contiguous in zz, 
//    x axis contiguous in xx
// using restrict because we know that all pointers will refer to different parts of memory. 
__global__ 
void element_wise_multiplication(
    pycuda::complex<double>* __restrict__ array_zz, 
    pycuda::complex<double>* __restrict__ array_xx,
    pycuda::complex<double>* __restrict__ out_array,
    unsigned array_zz_w, /*size of w,z,y, dimensions used in zz*/
    unsigned array_zz_z,
    unsigned array_zz_xx_y,/*size of y,x, dimensions used in xx, but both have same y*/
    unsigned array_xx_x){


    // z dimensions in blocks often have restrictions on size that can be fairly small, and sometimes can cause performance issues on older cards, we are going to derive x,y,z,w index from just the x and y indicies instead. 
    unsigned x_idx = blockIdx.x * (work_tile_dim) + threadIdx.x
    unsigned y_idx = blockIdx.y * (work_tile_dim) + threadIdx.y
    //blockIdx.z stores both z and w and should not over shoot, and aren't used
    //shown for the sake of how to get these dimensions. 
    unsigned z_idx = blockIdx.z % array_zz_z;
    unsigned w_idx = blockIdx.z / array_zz_z;
    //we already know this part of the indexing calculation. 
    unsigned out_idx_zw = blockIdx.z * (array_zz_xx_y * array_xx_z);
    // since our input array is actually 3D, this is a different calcualation
    unsigned array_zz_zw = blockIdx.z * (array_zz_xx_y)
    //ensures if our launch dimensions don't exactly match our input size, we don't 
    //accidently access out of bound memory, while branching can be bad, this isn't 
    // because 99.999% of the time no branch will occur and our instruction pointer 
    //will be the same per warp, meaning virtually zero cost. 
    if(x_idx < array_xx_x){
        //moving over y axis to coalesce memory accesses in the x dimension per warp. 
        for(int i = 0; i < ilp_factor; ++i){
            //need to also check y, these checks are virtually cost-less 
            // because memory access dominates time in such simple calculations, 
            // and arithmetic will be hidden by overlapping execution 
            if((y_idx+i) < array_zz_xx_y){
                //splitting up calculation for simplicity sake
                out_array_idx = out_idx_zw+(y_idx+i)*array_xx_x + x_idx;
                array_zz_idx = array_zz_zw + (y_idx+i);
                array_xx_idx = ((y_idx+i) * array_xx_x) + x_idx;
                //actual final output. 
                out_array[out_array_idx] = array_zz[array_zz_idx] * array_xx[array_xx_idx];
            }
        }
    }
}

您必须将启动尺寸设置为：

thread_dim = (work_tile_dim, work_tile_dim/ilp_factor) # (32,8)
y_dim = xx0.shape[0]
x_dim = xx0.shape[1]
wz_dim = zz0.shape[0] * zz0.shape[1]
block_dim = (x_dim/work_tile_dim, y_dim/work_tile_dim, wz_dim)

您还可以利用一些进一步的优化：

将全局内存访问存储在内核内部共享内存中的工作块中，这确保了对 zz0s“y”的访问，但实际上 x 维度在放入共享内存时被合并，提高性能，然后从共享内存访问（合并不重要，但银行冲突确实如此）。请参阅此处了解如何处理这种银行冲突。
不要计算欧拉公式并将双精度数扩展为复数双精度数，而是在内核本身内部扩展它，sincos(-x, &out_sin, &out_cos)用于实现相同的结果，但使用更少的内存带宽（参见此处）。

但请注意，即使这样做也可能不会为您提供您想要的性能（尽管仍然可能会更快），除非您使用的是具有完整双精度单元的高端 GPU，而大多数 GPU 上没有这些单元（大部分时间它被模仿）。双精度浮点单元占用大量空间，而且由于 gpus 用于图形，因此双精度并没有太多用处。如果您想要比浮点更高的精度，但又想利用浮点硬件而没有双倍的 1/8 到 1/32 吞吐量命中，您可以使用此答案中使用的技术在 gpu 上实现这一点，得到你接近 1/2 到 1/3 的吞吐量。

numpy - 在 Numpy 数组上使用 Pycuda 的 GPU 数组乘法

1 回答 1

Related

Reference