algorithm - 您将如何在 CUDA 中实现此功能？（排序整数向量中的偏移量）

Question

我在设备上有一个排序的整数数组，例如：

[0,0,0,1,1,2,2]

我想要另一个数组中每个元素的偏移量：

[0,3,5]

（因为第一个 0 在位置 0，第一个 1 在位置 3 等等）我事先知道会有多少不同的元素。您将如何在 CUDA 中有效地实现这一点？我不是要代码，而是要对您将实现以计算此转换的算法进行高级描述。我已经了解了推力名称空间中的各种函数，但想不出任何推力函数的组合来实现这一点。此外，这种转换是否有一个被广泛接受的名称？

score 4 · Accepted Answer

虽然我从未使用过推力库，但这种可能的方法呢（简单但可能有效）：

int input[N];  // your sorted array
int offset[N]; // the offset of the first values of each elements. Initialized with -1

// each thread will check an index position
if (input[id] > input[id-1]) // bingo! here begins a new value
{
    int oid = input[id];  // use the integer value as index
    offset[oid] = id;     // mark the offset with the beginning of the new value
}

在您的示例中，输出将是：

[0,3,5]

但是如果输入数组是：

[0,0,0,2,2,4,4]

然后输出将是：

[0,-1, 3, -1, 5]

现在，如果推力可以为您完成，请 remove_if( offset[i] == -1 ) 并压缩数组。

这种方法会为偏移数组浪费大量内存，但是由于您不知道要找到多少偏移量，最坏的情况将使用与输入数组一样多的内存。

另一方面，与全局内存负载相比，每个线程的少量指令将通过内存带宽限制此实现。这种情况有一些优化，因为每个线程处理一些值。

我的2美分！

score 4 · Accepted Answer

您可以在 Thrust 中使用thrust::unique_by_key_copywith解决此问题thrust::counting_iterator。这个想法是将您的整数数组作为keys参数，unique_by_key_copy并使用升序整数序列（即counting_iterator）作为values. unique_by_key_copy将值数组压缩为每个唯一的索引key：

#include <thrust/device_vector.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/unique.h>
#include <thrust/copy.h>
#include <iterator>
#include <iostream>

int main()
{
  thrust::device_vector<int> keys(7);
  keys[0] = 0; keys[1] = 0; keys[2] = 0;
  keys[3] = 1; keys[4] = 1; keys[5] = 2; keys[6] = 2;

  std::cout << "keys before unique_by_key_copy: [ ";
  thrust::copy(keys.begin(), keys.end(), std::ostream_iterator<int>(std::cout," "));
  std::cout << "]" << std::endl;

  thrust::device_vector<int> offsets(3);

  thrust::unique_by_key_copy(keys.begin(), keys.end(),          // keys
                             thrust::make_counting_iterator(0), // [0, 1, 2, 3, ...] are the values
                             thrust::make_discard_iterator(),   // discard the compacted keys
                             offsets.begin());                  // the offsets are the values

  std::cout << "offsets after unique_by_key_copy: [ ";
  thrust::copy(offsets.begin(), offsets.end(), std::ostream_iterator<int>(std::cout," "));
  std::cout << "]" << std::endl;

  return 0;
}

这是输出：

$ nvcc test.cu -run
keys before unique_by_key_copy: [ 0 0 0 1 1 2 2 ]
offsets after unique_by_key_copy: [ 0 3 5 ]

score 1 · Accepted Answer

Scan 是您正在寻找的算法。如果您周围没有实现，那么 Thrust 库将是一个很好的资源。（寻找推力::扫描）

扫描（或“并行前缀和”）接受一个输入数组并生成一个输出，其中每个元素是该点的输入之和：[1 5 3 7] => [1 6 9 16]

如果您扫描谓词（0 或 1，取决于评估条件），其中谓词检查给定元素是否与前一个元素相同，那么您计算相关元素的输出索引。您的示例数组

[0 0 0 1 1 2 2] [0 0 0 1 0 1 0] <= 谓词 [0 0 0 1 1 2 2] <= 扫描的谓词

现在您可以使用扫描的谓词作为索引来编写您的输出。

score 0 · Accepted Answer

好问题和答案取决于您之后需要做什么。让我解释。

一旦这个问题可以在 CPU 上以 O(n)（其中 n 是输入长度）解决，您将遭受内存分配和复制（主机 -> 设备（输入）和设备 -> 主机（结果））的缺点. 这将导致相对于简单 CPU 解决方案的性能下降。

即使您的数组已经在设备内存中，每个计算块也需要将其读取到本地或寄存器（至少访问设备内存），并且它不能比 CPU 更快地完成。

一般来说，如果满足以下条件，CUDA 可以很好地加速性能：

与输入数据长度相比，计算的渐近复杂度很高。例如，输入数据长度为 n，复杂度为 O(n^2) 或 O(n^3)。
有办法将任务拆分为独立或弱依赖的子任务。

所以如果我是你，如果可能的话，我不会尝试在 CUDA 上进行这种计算。如果它必须是一些独立的功能或输出格式转换为我会在 CPU 中做的一些其他功能。

如果它是一些更复杂算法的一部分，那么答案会更复杂。如果我在你的位置，我会尝试以某种方式更改[0,3,5]格式，因为它增加了使用 CUDA 计算能力的限制。您无法有效地将任务拆分为独立的块。例如，如果我在一个计算线程中处理 10 个整数，而在另一个计算线程中处理接下来的 10 个整数。第二个不知道在哪里放置他的输出，直到第一个没有完成。可能我会在子数组上拆分一个数组并分别存储每个子数组的答案。这在很大程度上取决于您正在执行的计算。

algorithm - 您将如何在 CUDA 中实现此功能？（排序整数向量中的偏移量）

4 回答 4

Related

Reference