optimization - 使用 CUDA 优化向量元素交换

Question

由于我是 cuda 新手 .. 我需要您的帮助我有这个长向量，对于每组 24 个元素，我需要执行以下操作：对于前 12 个元素，偶数元素乘以 -1，对于第二个 12 个元素，奇数元素乘以 -1，然后发生以下交换：

图表：因为我还没有足够的积分，所以我无法发布图像所以这里是：

http://www.freeimagehosting.net/image.php?e4b88fb666.png

我已经编写了这段代码，想知道您是否可以帮助我进一步优化它以解决分歧或银行冲突..

//subvector is a multiple of 24, Mds and Nds are shared memory

____shared____ double Mds[subVector];

____shared____ double Nds[subVector];

int tx = threadIdx.x;
int tx_mod = tx ^ 0x0001;
int  basex = __umul24(blockDim.x, blockIdx.x);

 Mds[tx] = M.elements[basex + tx];
__syncthreads();

// flip the signs 
 if (tx < (tx/24)*24 + 12)
 {  
    //if < 12 and even
    if ((tx & 0x0001)==0)
    Mds[tx] = -Mds[tx];
 }
 else
 if (tx < (tx/24)*24 + 24)
 {
    //if >12 and < 24 and odd
    if ((tx & 0x0001)==1)
    Mds[tx] = -Mds[tx];
 }

 __syncthreads();

 if (tx < (tx/24)*24 + 6)
 {  
//for the first 6 elements .. swap with last six in the 24elements group (see graph)
    Nds[tx] = Mds[tx_mod + 18];
    Mds [tx_mod + 18] = Mds [tx];
    Mds[tx] = Nds[tx];
 }
 else
 if (tx < (tx/24)*24 + 12)
 {
    // for the second 6 elements .. swp with next adjacent group (see graph)
    Nds[tx] = Mds[tx_mod + 6];
    Mds [tx_mod + 6] = Mds [tx];
    Mds[tx] = Nds[tx];
 }   
__syncthreads();

提前致谢 ..

score 1 · Accepted Answer

保罗给了你很好的起点你以前的问题。

需要注意的几件事：您正在做的是昂贵的非基础 2 除法。而是尝试利用线程块的多维特性。例如，将 x 尺寸设为 24，这将消除除法的需要。

一般来说，尽量适合线程块尺寸以反映您的数据尺寸。

简化符号翻转：例如，如果您不想翻转符号，您仍然可以乘以标识1。弄清楚如何仅使用算术将偶数/奇数映射到 1 和 -1：例如sign = (even*2+1) - 2 ，偶数是 1 或 0。

optimization - 使用 CUDA 优化向量元素交换

1 回答 1

Related

Reference