python - 用于处理大数据的 Pycuda 块和网格

Question

我需要帮助来了解我的块和网格的大小。我正在构建一个 python 应用程序来执行基于 scipy 的度量计算：欧几里得距离、曼哈顿、皮尔森、余弦、加入其他。

该项目是PycudaDistances。

它似乎适用于小型阵列。当我进行更详尽的测试时，不幸的是它没有用。我下载了电影镜头集（http://www.grouplens.org/node/73）。

使用Movielens100k，我声明了一个形状为 (943, 1682) 的数组。即用户对943部和1682部影片进行评价。不是分类器用户的电影我将值配置为 0。

使用更大的数组算法不再有效。我面临以下错误：

pycuda._driver.LogicError：cuFuncSetBlockShape 失败：无效值。

研究这个错误，我找到了一个解释，告诉 Andrew 支持 512 个线程加入并使用更大的块，必须使用块和网格。

我想要一个帮助来调整算法欧几里得距离数组，以适应从小到大的数组。

def euclidean_distances(X, Y=None, inverse=True):
    X, Y = check_pairwise_arrays(X,Y)
    rows = X.shape[0]
    cols = Y.shape[0]
    solution = numpy.zeros((rows, cols))
    solution = solution.astype(numpy.float32)

    kernel_code_template = """
    #include <math.h>
    
    __global__ void euclidean(float *x, float *y, float *solution) {

        int idx = threadIdx.x + blockDim.x * blockIdx.x;
        int idy = threadIdx.y + blockDim.y * blockIdx.y;
        
        float result = 0.0;
        
        for(int iter = 0; iter < %(NDIM)s; iter++) {
            
            float x_e = x[%(NDIM)s * idy + iter];
            float y_e = y[%(NDIM)s * idx + iter];
            result += pow((x_e - y_e), 2);
        }
        int pos = idx + %(NCOLS)s * idy;
        solution[pos] = sqrt(result);
    }
    """
    kernel_code = kernel_code_template % {
        'NCOLS': cols,
        'NDIM': X.shape[1]
    }

    mod = SourceModule(kernel_code)

    func = mod.get_function("euclidean")
    func(drv.In(X), drv.In(Y), drv.Out(solution), block=(cols, rows, 1))

    return numpy.divide(1.0, (1.0 + solution)) if inverse else solution

有关更多详细信息，请参阅：https ://github.com/vinigracindo/pycudaDistances/blob/master/distances.py

score 16 · Accepted Answer

To size the execution paramters for your kernel you need to do two things (in this order):

1. Determine the block size

Your block size will mostly be determined by hardware limitations and performance. I recommend reading this answer for more detailed information, but the very short summary is that your GPU has a limit on the total number of threads per block it can run, and it has a finite register file, shared and local memory size. The block dimensions you select must fall inside these limits, otherwise the kernel will not run. The block size can also effect the performance of kernel, and you will find a block size which gives optimal performance. Block size should always be a round multiple of the warp size, which is 32 on all CUDA compatible hardware released to date.

2. Determine the grid size

For the sort of kernel you have shown, the number of blocks you need is directly related to the amount of input data and the dimensions of each block.

If, for example, your input array size was 943x1682, and you had a 16x16 block size, you would need a 59 x 106 grid, which would yield 944x1696 threads in the kernel launch. In this case the input data size is not a round multiple of the block size, you will need to modify your kernel to ensure that it doesn't read out-of-bounds. One approach could be something like:

__global__ void euclidean(float *x, float *y, float *solution) {
    int idx = threadIdx.x + blockDim.x * blockIdx.x;
    int idy = threadIdx.y + blockDim.y * blockIdx.y;

     if ( ( idx < %(NCOLS)s ) && ( idy < %(NDIM)s ) ) {

        .....
     }
}

The python code to launch the kernel could look like something similar to:

bdim = (16, 16, 1)
dx, mx = divmod(cols, bdim[0])
dy, my = divmod(rows, bdim[1])

gdim = ( (dx + (mx>0)) * bdim[0], (dy + (my>0)) * bdim[1]) )
func(drv.In(X), drv.In(Y), drv.Out(solution), block=bdim, grid=gdim)

This question and answer may also help understand how this process works.

Please note that all of the above code was written in the browser and has never been tested. Use it at your own risk.

Also note it was based on a very brief reading of your code and might not be correct because you have not really described anything about how the code is called in your question.

score 2 · Accepted Answer

原则上接受的答案是正确的，但是 talonmies 列出的代码并不完全正确。这条线： gdim = ( (dx + (mx>0)) * bdim[0], (dy + (my>0)) * bdim[1]) ) 应该是： gdim = ( (dx + (mx>0)), (dy + (my>0)) ) 除了一个明显的额外括号外，gdim 会产生比你想要的太多的线程。talonmies 在他的文章中已经解释过，线程是 blocksize * gridSize。然而，他列出的 gdim 会为您提供总线程数，而不是所需的正确网格大小。

python - 用于处理大数据的 Pycuda 块和网格

2 回答 2

1. Determine the block size

2. Determine the grid size

Related

Reference