cuda - 共享内存中的无银行冲突访问

Question

我必须使用大小为 64 个元素的共享内存，是 32-bank 数量的两倍。所以情况是内存访问次数是扭曲中线程数的两倍。我应该如何解决它们以产生无银行冲突的访问？

score 2 · Accepted Answer

在 32 位内存访问的情况下，您可以使用默认内存访问模式。

__shared__ int shared[32];
int data = shared[base + stride * tid];

有stride奇怪的。

如果您有 64 位访问权限，则可以使用以下技巧：

struct type 
{  
   int x, y, z;
};
__shared__ struct type shared[32];
struct type data = shared[base + tid];

score 0 · Accepted Answer

假设您使用的是计算能力 1.x，因此您的共享内存有 16 个库，每个线程必须访问共享内存中的 2 个元素。

您想要一个线程访问两个元素的相同内存库，因此如果您组织它以使所需元素彼此相距 16 个，则应避免内存库冲突。

__shared__ int shared[32];
int data = shared[base + stride * tid];
int data = shared[base + stride * tid + 16];

我使用这种模式来存储复杂的浮点数，但是我有一个复杂的浮点数数组，所以它看起来像

#define TILE_WIDTH 16

__shared__ float shared[TILE_WIDTH][2*TILE_WIDTH + 1];
float real = shared[base + stride * tid];
float imag = shared[base + stride * tid + TILE_WIDTH];

+1 是为了避免转置访问模式中的序列化。

cuda - 共享内存中的无银行冲突访问

2 回答 2

Related

Reference