cuda - 使用 CUDA 压缩“稀疏数据”（CCL：连接组件标签缩减）

Question

我有一个 500 万个 32 位整数列表（实际上是 2048 x 2560 图像），其中 90% 是零。非零单元是完全不连续或以任何方式连续的标签（例如 2049、8195、1334300、34320923、4320932）（它是我们自定义连接组件标签 CCL 算法的输出）。我正在使用 NVIDA Tesla K40，所以如果它需要任何前缀扫描工作，我会喜欢它，它使用 SHUFFLE、BALLOT 或任何更高的 CC 功能。

我不需要一个完整的例子，只是一些建议。

为了说明这一点，这里有一篇由我们的 CCL 算法标记的博客。

标记的 Blob

其他 blob 将具有不同的唯一标签（例如 13282）。但是所有都将被零包围，并且是椭圆形的。（我们针对椭圆体优化了 CCL，这就是我们不使用这些库的原因）。但一个副作用是 blob 标签不是连续的数字。我们不关心它们的编号顺序，但我们想要一个标记为#1，另一个标记为#2，最后一个标记为#n，其中n 是图像中的斑点数。

我是什么意思标签#1？我的意思是所有 2242 个单元格都应该用 1 替换。所有 13282 个单元格都应该是 #2，等等。

我们 CCL 的最大 blob 数等于 2048x2560。所以我们知道数组的大小。

实际上，罗伯特·克罗维拉（Robert Crovella）在一天前已经对此给出了很好的答案。这并不准确，但我现在看到了如何应用答案。所以我不需要更多的帮助。但他在时间和精力上非常慷慨，并要求我用示例重新编写问题，所以我这样做了。

score 4 · Accepted Answer

一种可能的方法是使用以下序列：

thrust::transform- 将输入数据全部转换为 1 或 0：

0 27 42  0 18 99 94 91  0  -- input data
0  1  1  0  1  1  1  1  0  -- this will be our "mask vector"

thrust::inclusive_scan- 将掩码向量转换为渐进序列：

0  1  1  0  1  1  1  1  0  -- "mask" vector
0  1  2  2  3  4  5  6  6  -- "sequence" vector

另一个thrust::transform掩盖非增加值的方法：

0  1  1  0  1  1  1  1  0  -- "mask" vector
0  1  2  2  3  4  5  6  6  -- "sequence" vector
-------------------------
0  1  2  0  3  4  5  6  0  -- result of "AND" operation

请注意，我们可以将前两个步骤与thrust::transform_inclusive_scana 结合起来，然后thrust::transform使用稍有不同的变换函子执行第三步。这种修改允许我们省去创建临时“掩码”向量。

这是一个完整的示例，显示了使用“修改”的方法thrust::transform_inclusive_scan：

$ cat t635.cu
#include <iostream>
#include <stdlib.h>

#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/transform.h>
#include <thrust/transform_scan.h>
#include <thrust/generate.h>
#include <thrust/copy.h>


#define DSIZE 20
#define PCT_ZERO 40

struct my_unary_op
{
  __host__ __device__
  int operator()(const int data) const
  {
    return (!data) ?  0:1;}
};

struct my_binary_op
{
  __host__ __device__
  int operator()(const int d1, const int d2) const
  {
    return (!d1) ? 0:d2;}
};

int main(){

// generate DSIZE random 32-bit integers, PCT_ZERO% are zero
  thrust::host_vector<int> h_data(DSIZE);
  thrust::generate(h_data.begin(), h_data.end(), rand);
  for (int i = 0; i < DSIZE; i++)
    if ((rand()%100)< PCT_ZERO) h_data[i] = 0;
    else h_data[i] %= 1000;
  thrust::device_vector<int> d_data = h_data;
  thrust::device_vector<int> d_result(DSIZE);
  thrust::transform_inclusive_scan(d_data.begin(), d_data.end(), d_result.begin(), my_unary_op(), thrust::plus<int>());
  thrust::transform(d_data.begin(), d_data.end(), d_result.begin(), d_result.begin(), my_binary_op());
  thrust::copy(d_data.begin(), d_data.end(), std::ostream_iterator<int>(std::cout, ","));
  std::cout << std::endl;
  thrust::copy(d_result.begin(), d_result.end(), std::ostream_iterator<int>(std::cout, ","));
  std::cout << std::endl;
  return 0;
}

$ nvcc -o t635 t635.cu
$ ./t635
0,886,777,0,793,0,386,0,649,0,0,0,0,59,763,926,540,426,0,736,
0,1,2,0,3,0,4,0,5,0,0,0,0,6,7,8,9,10,0,11,
$

在我看来，响应更新，这些新信息使问题更难以解决。直方图技术浮现在脑海中，但对 32 位整数（标签）的占用范围没有任何限制，或者对特定标签在数据集中可能重复的次数没有任何限制，直方图技术似乎不切实际。这使我考虑对数据进行排序。

像这样的方法应该有效：

用于thrust::sort对数据进行排序。
用于thrust::unique删除重复项。
删除重复的排序数据现在为我们提供了输出集 [0,1,2, ...] 的排序。让我们称之为我们的“地图”。我们可以使用并行二进制搜索技术将原始数据集中的每个标签转换为其映射的输出值。

这个过程对我来说似乎相当“昂贵”。我建议重新考虑上游标记操作，看看是否可以重新设计它以生成更适合高效下游处理的数据集。

无论如何，这是一个完整的示例：

$ cat t635.cu
#include <iostream>
#include <stdlib.h>

#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/transform.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/unique.h>
#include <thrust/copy.h>


#define DSIZE 20
#define PCT_ZERO 40
#define RNG 10

#define nTPB 256

// sets idx to the index of the first element in a that is
// equal to or larger than key

__device__ void bsearch_range(const int *a, const int key, const unsigned len_a, unsigned *idx){
  unsigned lower = 0;
  unsigned upper = len_a;
  unsigned midpt;
  while (lower < upper){
    midpt = (lower + upper)>>1;
    if (a[midpt] < key) lower = midpt +1;
    else upper = midpt;
    }
  *idx = lower;
  return;
  }

__global__ void find_my_idx(const int *a, const unsigned len_a,  int *my_data, int *my_idx, const unsigned len_data){
  unsigned idx = (blockDim.x * blockIdx.x) + threadIdx.x;
  if (idx < len_data){
    unsigned sp_a;
    int val = my_data[idx];
    bsearch_range(a, val, len_a, &sp_a);
    my_idx[idx] = sp_a;
    }
}


int main(){

// generate DSIZE random 32-bit integers, PCT_ZERO% are zero
  thrust::host_vector<int> h_data(DSIZE);
  thrust::generate(h_data.begin(), h_data.end(), rand);
  for (int i = 0; i < DSIZE; i++)
    if ((rand()%100)< PCT_ZERO) h_data[i] = 0;
    else h_data[i] %= RNG;
  thrust::device_vector<int> d_data = h_data;
  thrust::device_vector<int> d_result = d_data;
  thrust::sort(d_result.begin(), d_result.end());
  thrust::device_vector<int> d_unique = d_result;
  int unique_size = thrust::unique(d_unique.begin(), d_unique.end()) - d_unique.begin();
  find_my_idx<<< (DSIZE+nTPB-1)/nTPB , nTPB >>>(thrust::raw_pointer_cast(d_unique.data()), unique_size, thrust::raw_pointer_cast(d_data.data()), thrust::raw_pointer_cast(d_result.data()), DSIZE);

  thrust::copy(d_data.begin(), d_data.end(), std::ostream_iterator<int>(std::cout, ","));
  std::cout << std::endl;
  thrust::copy(d_result.begin(), d_result.end(), std::ostream_iterator<int>(std::cout, ","));
  std::cout << std::endl;
  return 0;
}
$ nvcc t635.cu -o t635
$ ./t635
0,6,7,0,3,0,6,0,9,0,0,0,0,9,3,6,0,6,0,6,
0,2,3,0,1,0,2,0,4,0,0,0,0,4,1,2,0,2,0,2,
$

score 0 · Accepted Answer

我的答案类似于@RobertCrovella 给出的答案，但我认为使用它thrust::lower_bound而不是自定义二进制搜索更简单。（既然是纯推力，后端可以互换）

复制输入数据
对复制的数据进行排序
从排序的数据创建一个唯一的列表
在唯一列表中找到每个输入的下限

我在下面包含了一个完整的示例。有趣的是，通过预先挂起排序步骤并再次调用thrust::unique. 根据输入数据，这可以显着减少排序中的元素数量，这是这里的瓶颈。

#include <iostream>
#include <stdlib.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/transform.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/unique.h>
#include <thrust/binary_search.h>
#include <thrust/copy.h>

int main()
{
  const int ndata = 20;
  // Generate host input data
  thrust::host_vector<int> h_data(ndata);
  thrust::generate(h_data.begin(), h_data.end(), rand);
  for (int i = 0; i < ndata; i++)
  {
    if ((rand() % 100) < 40)
      h_data[i] = 0;
    else
      h_data[i] %= 10;
  }

  // Copy data to the device
  thrust::device_vector<int> d_data = h_data;
  // Make a second copy of the data
  thrust::device_vector<int> d_result = d_data;
  // Sort the data copy
  thrust::sort(d_result.begin(), d_result.end());
  // Allocate an array to store unique values
  thrust::device_vector<int> d_unique = d_result;
  {
    // Compress all duplicates
    const auto end = thrust::unique(d_unique.begin(), d_unique.end());
    // Search for all original labels, in this compressed range, and write their
    // indices back as the result
    thrust::lower_bound(
      d_unique.begin(), end, d_data.begin(), d_data.end(), d_result.begin());
  }

  thrust::copy(
    d_data.begin(), d_data.end(), std::ostream_iterator<int>(std::cout, ","));
  std::cout << std::endl;
  thrust::copy(d_result.begin(),
               d_result.end(),
               std::ostream_iterator<int>(std::cout, ","));
  std::cout << std::endl;
  return 0;
}

cuda - 使用 CUDA 压缩“稀疏数据”（CCL：连接组件标签缩减）

2 回答 2

Related

Reference