我正在将一个程序从一个常规的 c 程序转换为一个 cuda 程序,并希望为只使用一个大池的 malloc 实现一个简单的包装器。
我有 5000 个线程。我的块大小是 1024。
这是我用来跟踪每个线程内存池的缓冲区结构。
typedef struct buffer_t
{
unsigned long size;
char* current_index;
char pool[];
} buffer_t;
你可以想象我使用:
cudaMalloc(&memptr, 262144*5000);
进行分配,假设每个线程在其 262144 字节上创建一个缓冲区
以下是我用来进行分配的函数:
__device__ buffer_t* buffer_constructor(size_t size, void* memptr)
{
buffer_t* buffer = (buffer_t*)memptr;
buffer->size = size - sizeof(unsigned long) - sizeof(char*);
buffer->current_index = buffer->pool;
return buffer;
}
__device__ void* buffer_malloc(buffer_t* buffer, size_t size)
{
if(size > buffer->size - (buffer->current_index - buffer->pool))
{
return NULL;
}
void* ptr = buffer->current_index;
buffer->current_index += size;
return ptr;
}
每个线程调用:
buffer_t* buffer = buffer_constructor(size, memptr+(tid * size));
所以当我运行代码时,它只是在某个时候从内核返回。当我运行调试器时,出现此错误:
Program received signal CUDA_EXCEPTION_6, Warp Misaligned Address.
[Switching focus to CUDA kernel 0, grid 1, block (2,0,0), thread (768,0,0), device 0, sm 10, warp 24, lane 0]
0x0000000000b48428 in device_matrix_list_constructor (buffer=<optimized out>, num=<optimized out>)
at device_matrix_list.cu:8
8 return list;
当我运行 memcheck 时,我在几个块中得到了几个这样的错误:
Invalid __global__ write of size 8
========= at 0x00000258 in /home/crafton.b/cuda_nn/device_matrix_list.cu:7:device_matrix_list_constructor(buffer_t*, unsigned int)
========= by thread (897,0,0) in block (4,0,0)
========= Address 0x235202a0fc is misaligned
非常感谢任何帮助我已经为此苦苦挣扎了一段时间