我的目标是编写一个自定义归约内核,它返回每行的 argmax 以及最大值和子最大值之间的差异(第二大最大值)。我是 CUDA 的新手,我正在使用 cupy。作为第一步,我尝试编写自己的max(axis=1)
内核。有时它可以工作,但对于大型矩阵它会崩溃。
import cupy as cp
import numpy as np
maxval2d = cp.RawKernel(r'''
extern "C" __global__
#define THREADS_PER_BLOCK (32*32)
void my_maxval2d(unsigned int cols, int* src, int* dst) {
__shared__ int block_data[THREADS_PER_BLOCK];
unsigned int row = blockDim.y * blockIdx.y + threadIdx.y;
unsigned int col = blockDim.x * blockIdx.x + threadIdx.x;
unsigned int threadId = threadIdx.y * blockDim.x + threadIdx.x;
unsigned int i = row * cols + col;
block_data[threadId] = src[i];
__syncthreads();
// do reduction in shared mem
for(unsigned int stride = blockDim.x/2; stride > 0; stride >>= 1) {
if (threadIdx.x < stride) {
int& a = block_data[threadId];
const int& b = block_data[threadId + stride];
if(b > a) {
a = b;
}
}
__syncthreads();
}
// write result for this block to global memory
if (threadIdx.x == 0) {
unsigned int left_col = row * cols + blockIdx.x;
dst[left_col] = block_data[blockDim.x * threadIdx.y];
}
}
''', 'my_maxval2d')
cols = 32*32
rows = 32
cp.random.seed(1)
src = cp.random.random((rows, cols))
src = (src*900 + 100).astype(cp.int32) # make integers from 100-999
dst = cp.zeros((rows, cols))
dst = dst.astype(cp.int32)
print('baseline:', src.max(axis=1)[0])
threads = 32
remaining = cols
counter = 0
while remaining > 1:
block_dim = (remaining//threads, rows)
thread_dim = (threads, rows)
print(f'loop {counter}, remaining: {remaining}, block_dim: {block_dim}, thread_dim: {thread_dim}')
maxval2d(block_dim, thread_dim, (cols, src, dst))
remaining //= threads
src, dst = dst, src
counter += 1
print('custom:', dst[0,0])
内核的基本轮廓取自CUDA Webinar 幻灯片。我知道此代码对于非 32 次方矩阵可能有不正确的结果,但对于我的 (32, 1024) 矩阵,我希望得到以下结果:
baseline: 996
loop 0, remaining: 1024, block_dim: (32, 32), thread_dim: (32, 32)
loop 1, remaining: 32, block_dim: (1, 32), thread_dim: (32, 32)
custom: 996
事实上,当我设置cols = 32
and时print(dst[0,0])
,我得到:
baseline: 994
loop 0, remaining: 32, block_dim: (1, 32), thread_dim: (32, 32)
custom: 994
但是使用 (32, 1024) 矩阵我得到:
---------------------------------------------------------------------------
CUDARuntimeError Traceback (most recent call last)
<ipython-input-17-858a0ab67cd5> in <module>()
58 src, dst = dst, src
59 counter += 1
---> 60 print('custom:', src[0,0])
cupy/core/core.pyx in cupy.core.core.ndarray.__str__()
cupy/core/core.pyx in cupy.core.core.ndarray.get()
cupy/cuda/memory.pyx in cupy.cuda.memory.MemoryPointer.copy_to_host()
cupy/cuda/runtime.pyx in cupy.cuda.runtime.memcpy()
cupy/cuda/runtime.pyx in cupy.cuda.runtime.check_status()
CUDARuntimeError: cudaErrorIllegalAddress: an illegal memory access was encountered
我的直觉说,在内核的某个地方,它超出了界限。但我不明白那可能在哪里。如何修复此代码以获得预期结果?