我有那个代码:
struct __declspec(align(32)) Circle
{
float x, y;
float prevX, prevY;
float speedX, speedY;
float mass;
float radius;
void init(const int _x, const int _y, const float _speedX = 0.0f, const float _speedY = 0.0f,
const float _radius = CIRCLE_RADIUS_DEFAULT,
const float _mass = CIRCLE_MASS_DEFAULT);
};
第二个:
/*smem[threadIdx.x] = *(((float*)cOut) + threadIdx.x);
smem[threadIdx.x + blockDim.x] = *(((float*)cOut) + threadIdx.x + blockDim.x);
smem[threadIdx.x + blockDim.x * 2] = *(((float*)cOut) + threadIdx.x + blockDim.x * 2);
smem[threadIdx.x + blockDim.x * 3] = *(((float*)cOut) + threadIdx.x + blockDim.x * 3);
smem[threadIdx.x + blockDim.x * 4] = *(((float*)cOut) + threadIdx.x + blockDim.x * 4);
smem[threadIdx.x + blockDim.x * 5] = *(((float*)cOut) + threadIdx.x + blockDim.x * 5);
smem[threadIdx.x + blockDim.x * 6] = *(((float*)cOut) + threadIdx.x + blockDim.x * 6);
smem[threadIdx.x + blockDim.x * 7] = *(((float*)cOut) + threadIdx.x + blockDim.x * 7);*/
__syncthreads();
/*float x, y;
float prevX, prevY;
float speedX, speedY;
float mass;
float radius;*/
/*c.x = smem[threadIdx.x];
c.y = smem[threadIdx.x + blockDim.x]; //there must be [threadId.x * 8 + 0]
c.prevX = smem[threadIdx.x + blockDim.x * 2]; //[threadId.x * 8 + 1] and e.t.c.
c.prevY = smem[threadIdx.x + blockDim.x * 3];
c.speedX = smem[threadIdx.x + blockDim.x * 4];
c.speedY = smem[threadIdx.x + blockDim.x * 5];
c.mass = smem[threadIdx.x + blockDim.x * 6];
c.radius = smem[threadIdx.x + blockDim.x * 7];*/
c = cOut[j];
//c = *((Circle*)(smem + threadIdx * SMEM));
有2个gmem(我的意思是全局内存)访问:1)读取Circle并检测与它的碰撞2)在改变它的速度和位置后写Circle我还有Circle的circlesConst-massive,它是由cudaMallocToSybol()分配的。它用于检查与从 gmem 读取的主圆 C(它在寄存器中)的圆的交集。
正如我所想,我很好地使用了 const 内存,它让我获得了所有的性能:')(我错了吗?)
当我读到对 gmem 的合并访问(是否有对其他类型的内存的合并访问?我没有找到任何有关它的信息)时,我想为我尝试一下。如您所见,Circle-structure 有 8 个类型为 float = 32 位的变量。我尝试(在代码中对其进行了注释)这样做,但是,首先,我得到了错误的答案(因为我必须不正确地从 smem 中读取,如下所述),其次,我的性能降低了 33%。为什么?我认为,它不依赖于错误的字段关系。
第二个问题,正如我在从 smem 到 C 的读取附近的代码中的注释中所写,我必须以另一种方式阅读,但如果这样做,将会有很多银行冲突,所以我会得到更少的性能...那么,我怎样才能加载没有银行冲突的合并的圈子,然后再将其写回?
ps 大小超过 4*float 的结构是否位于寄存器中?
更新: 最新版本是:
#define CF (9) //9 because the primary struct has 8 floats, so 1 is for wasting
i = blockIdx.x * blockDim.x;
smem[threadIdx.x + blockDim.x * 0 + blockDim.x * 0 / (CF - 1) + threadIdx.x / (CF - 1)] = *(((float*)(cOut + i)) + threadIdx.x + blockDim.x * 0);
smem[threadIdx.x + blockDim.x * 1 + blockDim.x * 1 / (CF - 1) + threadIdx.x / (CF - 1)] = *(((float*)(cOut + i)) + threadIdx.x + blockDim.x * 1);
smem[threadIdx.x + blockDim.x * 2 + blockDim.x * 2 / (CF - 1) + threadIdx.x / (CF - 1)] = *(((float*)(cOut + i)) + threadIdx.x + blockDim.x * 2);
smem[threadIdx.x + blockDim.x * 3 + blockDim.x * 3 / (CF - 1) + threadIdx.x / (CF - 1)] = *(((float*)(cOut + i)) + threadIdx.x + blockDim.x * 3);
smem[threadIdx.x + blockDim.x * 4 + blockDim.x * 4 / (CF - 1) + threadIdx.x / (CF - 1)] = *(((float*)(cOut + i)) + threadIdx.x + blockDim.x * 4);
smem[threadIdx.x + blockDim.x * 5 + blockDim.x * 5 / (CF - 1) + threadIdx.x / (CF - 1)] = *(((float*)(cOut + i)) + threadIdx.x + blockDim.x * 5);
smem[threadIdx.x + blockDim.x * 6 + blockDim.x * 6 / (CF - 1) + threadIdx.x / (CF - 1)] = *(((float*)(cOut + i)) + threadIdx.x + blockDim.x * 6);
smem[threadIdx.x + blockDim.x * 7 + blockDim.x * 7 / (CF - 1) + threadIdx.x / (CF - 1)] = *(((float*)(cOut + i)) + threadIdx.x + blockDim.x * 7);
c.x = smem[threadIdx.x * CF + 0];
c.y = smem[threadIdx.x * CF + 1];
c.prevX = smem[threadIdx.x * CF + 2];
c.prevY = smem[threadIdx.x * CF + 3];
c.speedX = smem[threadIdx.x * CF + 4];
c.speedY = smem[threadIdx.x * CF + 5];
c.mass = smem[threadIdx.x * CF + 6];
c.radius = smem[threadIdx.x * CF + 7];
使用 smem 合并 gmem 访问权限是否正确?我的意思是,我害怕BlockDim.x * 1 / (CF - 1) + threadIdx.x / (CF - 1)
。我想,我没有得到一些提升,因为它不允许 gmem 合并阅读超过一个圈子,但我不明白,如何让它合并两个圈子..