cuda - Implementing parallel merging in CUDA with shared memory does not enhance performance

Question

I am trying to implement a parallel merging algorithm in CUDA. The algorithm is designed to be executed in one thread block. Its basic idea is to compute the global rank of each element in two input sequences. Since these two input sequences are sorted, a element's global rank is equal to its index in its original sequence plus its rank in the other sequence which is computed by binary search. I think the best strategy for implementing such algorithm is to load two sequences into shared memory to reduce the global memory read. However, when I compared two version of implementation, one using shared memory and one without using shared memory, I can't see the performance enhancement. I wonder if I am doing something wrong.

hardware: GeForce GTX 285, Linux x86_64. time to merge two sequences of 1024 elements for both implementations is about 0.068672 ms.

__global__ void localMerge(int * A, int numA,int * B,int numB,int * C){
extern __shared__ int  temp[]; // shared memory for A and B; 
int tx=threadIdx.x;
int size=blockDim.x;
int *tempA=temp;
int *tempB=temp+numA;

int i,j,k,mid;
    //read sequences into shared memory 
for(i=tx;i<numA;i+=size){
    tempA[i]=A[i];
}
for(i=tx;i<numB;i+=size){
    tempB[i]=B[i];
}
__syncthreads();
    //compute global rank for elements in sequence A
for(i=tx;i<numA;i+=size){
    j=0;
    k=numB-1;
    if(tempA[i]<=tempB[0]){
        C[i]=tempA[i];
    }
    else if(tempA[i]>tempB[numB-1]){
        C[i+numB]=tempA[i];
    }
    else{
        while(j<k-1){
            mid=(j+k)/2;
            if(tempB[mid]<tempA[i]){
                j=mid;
            }
            else{
                k=mid;
            }
        }
        //printf("i=%d,j=%d,C=%d\n",i,j,tempA[i]);
        C[i+j+1]=tempA[i];
    }
}   
    //compute global rank for elements in sequence B
for(i=tx;i<numB;i+=size){
    j=0;
    k=numA-1;
    if(tempB[i]<tempA[0]){
        C[i]=tempB[i];
    }
    else if(tempB[i]>=tempA[numA-1]){
        C[i+numA]=tempB[i];
    }
    else{
        while(j<k-1){
            mid=(j+k)/2;
            if(tempA[mid]<=tempB[i]){
                j=mid;
            }
            else{
                k=mid;
            }
        }
        //printf("i=%d,j=%d,C=%d\n",i,j,tempB[i]);
        C[i+j+1]=tempB[i];
    }
}    
}

score 4 · Accepted Answer

应用“合并路径__shared__”算法可能比依靠在内存中的两个输入列表中进行并行细粒度二进制搜索的集合更幸运。使用__shared__内存来解决这个问题不太重要，因为该算法中存在的重用可以很好地被缓存捕获。

使用这种合并算法，其想法是 CTA 的每个线程都负责k在合并结果中生成输出。这有一个很好的特性，即每个线程的工作大致一致，并且涉及的二进制搜索是相当粗粒度的。

线程i同时搜索两个输入列表以k*i查找第 th 个输出元素在每个列表中的位置。然后工作很简单：每个线程依次合并k输入列表中的项目并将它们复制到k*i输出中的位置。

具体可以参考Thrust 的实现。

cuda - Implementing parallel merging in CUDA with shared memory does not enhance performance

1 回答 1

Related

Reference