performance - 提高 CUDA 中 Compact/Scatter 的效率

Question

概括：

关于如何进一步改进 CUDA 中的基本分散操作的任何想法？特别是如果有人知道它只会用于将更大的阵列压缩成更小的阵列？或者为什么以下向量化内存操作和共享内存的方法不起作用？我觉得我可能缺少一些基本的东西，任何帮助都将不胜感激。

编辑 03/09/15：所以我发现这篇Parallel For All 博客文章“使用 Warp-Aggregated Atomics 优化过滤”。为此，我曾假设原子本质上会变慢，但我错了——尤其是因为我认为我不关心在模拟过程中维护数组中的元素顺序。我将不得不考虑更多，然后实施它以查看会发生什么！

编辑 01/04/16：我意识到我从来没有写过我的结果。不幸的是，在那篇 Parallel for All 博客文章中，他们将紧凑的全局原子方法与 Thrust 前缀和紧凑方法进行了比较，这实际上非常慢。CUB 的 Device::IF 比 Thrust 快得多——我使用 CUB 的 Device::Scan + 自定义代码编写的前缀和版本也是如此。warp-aggregate 全局原子方法仍然快约 5-10%，但远不及我根据博客中的结果所希望的 3-4 倍。我仍然使用前缀和方法，因为虽然不需要维护元素顺序，但我更喜欢前缀和结果的一致性，并且原子的优势不是很大。我仍然尝试各种方法来改善紧凑，

细节：

我正在 CUDA 中编写一个模拟，在其中我压缩了我不再对每 40-60 个时间步进行模拟感兴趣的元素。从分析看来，分散操作在压缩时占用的时间最多 - 比过滤器内核或前缀总和更多。现在我使用一个非常基本的分散函数：

    __global__ void scatter_arrays(float * new_freq, const float * const freq, const int * const flag, const int * const scan_Index, const int freq_Index){
            int myID =  blockIdx.x*blockDim.x + threadIdx.x;
            for(int id = myID; id < freq_Index; id+= blockDim.x*gridDim.x){
                 if(flag[id]){
                    new_freq[scan_Index[id]] = freq[id];
                 }
             } 
    }

freq_Index 是旧数组中的元素数。标志数组是过滤器的结果。Scan_ID 是标志数组上前缀和的结果。

我为改进它所做的尝试是首先将标记的频率读入共享内存，然后从共享内存写入全局内存——这个想法是对全局内存的写入将在扭曲之间更加合并（例如，而不是线程 0写入位置 0，线程 128 写入位置 1，线程 0 将写入 0，线程 1 将写入 1）。我还尝试将读取和写入矢量化——而不是读取和写入浮点数/整数，我尽可能从全局数组中读取/写入 float4/int4，因此一次四个数字。我认为这可能会通过更少的内存操作传输更多的内存来加速分散。具有矢量化内存加载/存储和共享内存的“厨房水槽”代码如下：

    const int compact_threads = 256;
    __global__ void scatter_arrays2(float * new_freq, const float * const freq, const int * const flag, const int * const scan_Index, const int freq_Index){
        int gID =  blockIdx.x*blockDim.x + threadIdx.x; //global ID
        int tID = threadIdx.x; //thread ID within block
        __shared__ float row[4*compact_threads];
        __shared__ int start_index[1];
        __shared__ int end_index[1];
        float4 myResult;
        int st_index;
        int4 myFlag;
        int4 index;
        for(int id = gID; id < freq_Index/4; id+= blockDim.x*gridDim.x){
            if(tID == 0){
                index = reinterpret_cast<const int4*>(scan_Index)[id];
                myFlag = reinterpret_cast<const int4*>(flag)[id];
                start_index[0] = index.x;
                st_index = index.x;
                myResult = reinterpret_cast<const float4*>(freq)[id];
                if(myFlag.x){ row[0] = myResult.x; }
                if(myFlag.y){ row[index.y-st_index] = myResult.y; }
                if(myFlag.z){ row[index.z-st_index] = myResult.z; }
                if(myFlag.w){ row[index.w-st_index] = myResult.w; }
            }
            __syncthreads();
            if(tID > 0){
                myFlag = reinterpret_cast<const int4*>(flag)[id];
                st_index = start_index[0];
                index = reinterpret_cast<const int4*>(scan_Index)[id];
                myResult = reinterpret_cast<const float4*>(freq)[id];
                if(myFlag.x){ row[index.x-st_index] = myResult.x; }
                if(myFlag.y){ row[index.y-st_index] = myResult.y; }
                if(myFlag.z){ row[index.z-st_index] = myResult.z; }
                if(myFlag.w){ row[index.w-st_index] = myResult.w; }
                if(tID == blockDim.x -1 || gID == mutations_Index/4 - 1){ end_index[0] = index.w + myFlag.w; }
            }
            __syncthreads();
            int count = end_index[0] - st_index;

            int rem = st_index & 0x3; //equivalent to modulo 4
            int offset = 0;
            if(rem){ offset = 4 - rem; }

            if(tID < offset && tID < count){
                new_mutations_freq[population*new_array_Length+st_index+tID] = row[tID];
            }

            int tempID = 4*tID+offset;
            if((tempID+3) < count){
                reinterpret_cast<float4*>(new_freq)[tID] = make_float4(row[tempID],row[tempID+1],row[tempID+2],row[tempID+3]);
            }

            tempID = tID + offset + (count-offset)/4*4;
            if(tempID < count){ new_freq[st_index+tempID] = row[tempID]; }
        }
        int id = gID + freq_Index/4 * 4; 
        if(id < freq_Index){
            if(flag[id]){
                new_freq[scan_Index[id]] = freq[id];
            }
        }
    }

显然它变得有点复杂。:) 虽然当数组中有数十万个元素时，上述内核看起来很稳定，但我注意到当数组数以千万计时出现竞争情况。我仍在尝试追踪错误。

但无论如何，没有一种方法（共享内存或矢量化）一起或单独提高性能。我对向量化内存操作缺乏好处感到特别惊讶。它对我编写的其他函数有所帮助，但现在我想知道它是否有帮助，因为它在其他函数的计算步骤中增加了指令级并行性，而不是减少了内存操作。

score 1 · Accepted Answer

我发现这张海报中提到的算法（本文中也讨论了类似的算法）效果很好，尤其是对于压缩大型数组。它使用更少的内存来执行此操作，并且比我以前的方法（5-10%）略快。我对海报的算法进行了一些调整：1）消除阶段 1 中的最终扭曲 shuffle 减少，可以在计算元素时简单地求和，2）使函数能够处理不仅仅是大小为的数组1024 的倍数 + 添加网格跨步循环，以及 3) 允许每个线程在第 3 阶段同时加载它们的寄存器，而不是一次加载一个。我还使用 CUB 而不是 Thrust for Inclusive sum 来加快扫描速度。我可以做更多的调整，但现在这很好。

//kernel phase 1
int myID =  blockIdx.x*blockDim.x + threadIdx.x;
//padded_length is nearest multiple of 1024 > true_length
for(int id = myID; id < (padded_length >> 5); id+= blockDim.x*gridDim.x){
    int lnID = threadIdx.x % warp_size;
    int warpID = id >> 5;

    unsigned int mask;
    unsigned int cnt=0;//;//

    for(int j = 0; j < 32; j++){
        int index = (warpID<<10)+(j<<5)+lnID;
        
        bool pred;
        if(index > true_length) pred = false;
        else pred = predicate(input[index]);
        mask = __ballot(pred); 

        if(lnID == 0) {
            flag[(warpID<<5)+j] = mask;
            cnt += __popc(mask);
        }
    }

    if(lnID == 0) counter[warpID] = cnt; //store sum
}

//kernel phase 2 -> CUB Inclusive sum transforms counter array to scan_Index array

//kernel phase 3
int myID =  blockIdx.x*blockDim.x + threadIdx.x;

for(int id = myID; id < (padded_length >> 5); id+= blockDim.x*gridDim.x){
    int lnID = threadIdx.x % warp_size;
    int warpID = id >> 5;

    unsigned int predmask;
    unsigned int cnt;

    predmask = flag[(warpID<<5)+lnID];
    cnt = __popc(predmask);

    //parallel prefix sum
#pragma unroll
    for(int offset = 1; offset < 32; offset<<=1){
        unsigned int n = __shfl_up(cnt, offset);
        if(lnID >= offset) cnt += n;
    }

    unsigned int global_index = 0;
    if(warpID > 0) global_index = scan_Index[warpID - 1];

    for(int i = 0; i < 32; i++){
        unsigned int mask = __shfl(predmask, i); //broadcast from thread i
        unsigned int sub_group_index = 0;
        if(i > 0) sub_group_index = __shfl(cnt, i-1);
        if(mask & (1 << lnID)){
            compacted_array[global_index + sub_group_index + __popc(mask & ((1 << lnID) - 1))] = input[(warpID<<10)+(i<<5)+lnID]; 
        }
    }
}

}

编辑：海报作者的一个子集有一篇较新的文章，他们在其中检查了比上面写的更快的紧凑变化。但是，他们的新版本不保留订单，因此对我自己没有用，我还没有实现它来测试它。也就是说，如果您的项目不依赖于对象顺序，那么他们更新的紧凑版本可能会加快您的算法。

performance - 提高 CUDA 中 Compact/Scatter 的效率

1 回答 1

Related

Reference