1

我正在将一个程序从一个常规的 c 程序转换为一个 cuda 程序,并希望为只使用一个大池的 malloc 实现一个简单的包装器。

我有 5000 个线程。我的块大小是 1024。

这是我用来跟踪每个线程内存池的缓冲区结构。

typedef struct buffer_t
{
    unsigned long size;
    char* current_index;
    char pool[];
} buffer_t;

你可以想象我使用:

cudaMalloc(&memptr, 262144*5000);

进行分配,假设每个线程在其 262144 字节上创建一个缓冲区

以下是我用来进行分配的函数:

__device__ buffer_t* buffer_constructor(size_t size, void* memptr)
{
    buffer_t* buffer = (buffer_t*)memptr;
    buffer->size = size - sizeof(unsigned long) - sizeof(char*);
    buffer->current_index = buffer->pool;
    return buffer;
}
__device__ void* buffer_malloc(buffer_t* buffer, size_t size)
{
    if(size > buffer->size - (buffer->current_index - buffer->pool))
    {
        return NULL;
    }

    void* ptr = buffer->current_index;
    buffer->current_index += size;
    return ptr;
}

每个线程调用:

buffer_t* buffer = buffer_constructor(size, memptr+(tid * size));

所以当我运行代码时,它只是在某个时候从内核返回。当我运行调试器时,出现此错误:

Program received signal CUDA_EXCEPTION_6, Warp Misaligned Address.
[Switching focus to CUDA kernel 0, grid 1, block (2,0,0), thread (768,0,0), device 0, sm 10, warp 24, lane 0]
0x0000000000b48428 in device_matrix_list_constructor (buffer=<optimized   out>, num=<optimized out>)
    at device_matrix_list.cu:8
8               return list;

当我运行 memcheck 时,我在几个块中得到了几个这样的错误:

Invalid __global__ write of size 8
=========     at 0x00000258 in    /home/crafton.b/cuda_nn/device_matrix_list.cu:7:device_matrix_list_constructor(buffer_t*, unsigned int)
=========     by thread (897,0,0) in block (4,0,0)
=========     Address 0x235202a0fc is misaligned

非常感谢任何帮助我已经为此苦苦挣扎了一段时间

4

0 回答 0