parallel-processing - CUDA 和并行搜索

Question

我想做以下事情：假设我有一个大小为 N（N 相当大）的排序数字向量和一个数字 x。我想并行搜索此向量中数字 x 的正确位置。例如：

myVector = [ 1, 2, 3, .... , 10000] 和 x = 3.2,

然后我必须返回 3。第一个找到正确位置的线程应该中断其他线程的工作。那么花费的时间将会最小化： t= min(t_1, t_2,......, t_number of threads) 你认为使用多线程来寻找正确的位置会更快吗？线程之间的通信呢？由于一旦某个值被线程变为红色并且与搜索不匹配，其他线程必须在搜索期间跳过该值（可能是一个必须更改的布尔值..

你对这个算法有什么建议要分享吗？

score 0 · Accepted Answer

无需在线程和块之间进行通信。您可以检查当前索引处的值是否大于预期。如果是这样返回。大多数线程将无法通过此检查。

现在您只有具有索引值小于预期值的线程。检查下一个值是否大于或等于查询并返回适当的索引。

这是我早上 5 点写的未经测试的内核。

template<typename ty>
__global___ static void search(int *out, ty *list, ty val, int n)
{
    int start = threadIdx.x + blockIdx.x * blockDim.x;
    for (int idx = start; idx < n; idx += gridDim.x * blockDim.x) {
        if (list[idx] >= val) return;
        ty next = list[idx + 1];
        if (idx == n-1 || next >= val) {
            *out = next == val ? (idx + 1) : idx;
            return;
        }
     }
}

也就是说，你真的不想这样做。在使用 CPU 时，您可以获得 O(log n) 的最坏情况性能。这意味着搜索十亿个元素可以分 32 步完成。除非您已经在 gpu 上拥有数据并且想要避免内存复制，否则这在 CPU 上要好得多。

score 0 · Accepted Answer

前段时间我写了下面的代码做类似的事情：

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <stdlib.h>

__global__ void fast_finder(unsigned int *g_found, float x, float *y)
{
    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
    unsigned int pos = (unsigned int)(x == y[i]);
    g_found[i * (1 - pos)] = i * pos;
}

int main(int argc, char *argv[])
{
    int N = 65536;
    unsigned int h_found, *d_found;
    float *h_y = (float *)malloc(N * sizeof(float)), *d_y, x = 5.0f;
    int nThreads = 1024, nBloks = N / nThreads;

    for (int i = 0; i < N; ++i) h_y[i] = (float)(N - i - 1);

    if (x != h_y[0]) {
        cudaSetDevice(0);
        cudaMalloc((void **)&d_found, N * sizeof(unsigned int));
        cudaMalloc((void **)&d_y, N * sizeof(float));
        cudaMemcpy(d_y, h_y, N * sizeof(float), cudaMemcpyHostToDevice);

        fast_finder<<<nBloks, nThreads>>>(d_found, x, d_y);
        cudaThreadSynchronize();

        cudaMemcpy(&h_found, d_found, sizeof(unsigned int), cudaMemcpyDeviceToHost);
        if (h_found) printf("%g found on %d. position!\n", x, h_found);
        else printf("%g not found!\n", x);

        cudaFree(d_y);
        cudaFree(d_found);

    } else printf("%g found on the first position!\n", x);

    free(h_y);

    getchar();
    return EXIT_SUCCESS;
}

这里每个线程检查全局线程索引 in 提供的值y是否等于x。如果它是真的，线程将它的索引写入g_found数组的第一个位置，否则将 0 写入g_found它的索引提供的位置。对于y长度为 16 的，y输出中第 11 位包含值 5 如下：

g_found = { 10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }

在这种情况下y，不需要排序，但必须只包含唯一值。此代码可以很容易地更改为x将插入提供的查找（设备部分）索引，如下所示：

__global__ void fast_finder(unsigned int *g_found, float x, float *y)
{
    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
    unsigned int pos = (unsigned int)(x >= y[i] || x <= y[i+1]);
    g_found[i * (1 - pos)] = (i + 1) * pos;
}

这个版本的输出与我的相似。当g_found位置 0 为 0x时，数组中不存在的值y。的第一个元素y是否等于x由主机代码检查，甚至在内核被调用之前。更改此部分以应用您想要的条件也不是问题。

如您所见，在这样的解决方案中，所有线程一起工作，并且不需要任何执行终止，只要x找到。还可以应用数据包搜索，这意味着分配一个线程在的一小部分中搜索y，从而允许y更大。

parallel-processing - CUDA 和并行搜索

2 回答 2

Related

Reference