sorting - 如何在 CUDA 上对数组进行部分排序？

Question

问题

假设我有两个数组：

   const int N = 1000000;
   float A[N];
   myStruct *B[N];

A 中的数字可以是正数或负数（例如A[N]={3,2,-1,0,5,-2}），如何在 GPU 上使数组 A部分排序（首先是所有正值，不需要排序，然后是负值）（例如A[N]={3,2,5,0,-1,-2}或）？A[N]={5,2,3,0,-2,-1}数组 B 应该根据 A 改变（A 是键，B 是值）。

由于 A,B 的规模可以很大，所以我认为排序算法应该在 GPU 上实现（尤其是在 CUDA 上，因为我使用这个平台）。我当然知道thrust::sort_by_key可以完成这项工作，但它确实需要额外的工作，因为我不需要对数组 A&B 进行完全排序。

有没有人遇到过这种问题？

推力示例

thrust::sort_by_key(thrust::device_ptr<float> (A), 
            thrust::device_ptr<float> ( A + N ),  
            thrust::device_ptr<myStruct> ( B ),  
            thrust::greater<float>() );

score 1 · Accepted Answer

Thrust 在 Github 上的文档不是最新的。正如@JaredHoberock 所说，thrust::partition这是要走的路，因为它现在支持 stencils。您可能需要从Github 存储库获取副本：

git clone git://github.com/thrust/thrust.git

然后scons doc在 Thrust 文件夹中运行以获取更新的文档，并在编译代码时使用这些更新的 Thrust 源 ( nvcc -I/path/to/thrust ...)。使用新的模板分区，您可以：

#include <thrust/partition.h>
#include <thrust/execution_policy.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/tuple.h>

struct is_positive
{
__host__ __device__
bool operator()(const int &x)
{
  return x >= 0;
}
};


thrust::partition(thrust::host, // if you want to test on the host
                  thrust::make_zip_iterator(thrust::make_tuple(keyVec.begin(), valVec.begin())),
                  thrust::make_zip_iterator(thrust::make_tuple(keyVec.end(), valVec.end())),
                  keyVec.begin(),
                  is_positive());

这将返回：

Before:
  keyVec =   0  -1   2  -3   4  -5   6  -7   8  -9
  valVec =   0   1   2   3   4   5   6   7   8   9
After:
  keyVec =   0   2   4   6   8  -5  -3  -7  -1  -9
  valVec =   0   2   4   6   8   5   3   7   1   9

请注意，这 2 个分区不一定是排序的。此外，原始向量和分区之间的顺序可能不同。如果这对您很重要，您可以使用thrust::stable_partition：

stable_partition 与 partition 的不同之处在于 stable_partition 保证保持相对顺序。也就是说，如果 x 和 y 是 [first, last) 中的元素，使得 pred(x) == pred(y)，并且如果 x 在 y 之前，那么在 stable_partition 之后 x 在 y 之前仍然为真。

如果你想要一个完整的例子，这里是：

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/partition.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/tuple.h>

struct is_positive
{
__host__ __device__
bool operator()(const int &x)
{
  return x >= 0;
}
};

void print_vec(const thrust::host_vector<int>& v)
{
  for(size_t i = 0; i < v.size(); i++)
    std::cout << "  " << v[i];
  std::cout << "\n";
}

int main ()
{
  const int N = 10;

  thrust::host_vector<int> keyVec(N);
  thrust::host_vector<int> valVec(N);

  int sign = 1;
  for(int i = 0; i < N; ++i)
  {
    keyVec[i] = sign * i;
    valVec[i] = i;
    sign *= -1;
  }

  // Copy host to device
  thrust::device_vector<int> d_keyVec = keyVec;
  thrust::device_vector<int> d_valVec = valVec;

  std::cout << "Before:\n  keyVec = ";
  print_vec(keyVec);
  std::cout << "  valVec = ";
  print_vec(valVec);

  // Partition key-val on device
  thrust::partition(thrust::make_zip_iterator(thrust::make_tuple(d_keyVec.begin(), d_valVec.begin())),
                    thrust::make_zip_iterator(thrust::make_tuple(d_keyVec.end(), d_valVec.end())),
                    d_keyVec.begin(),
                    is_positive());

  // Copy result back to host
  keyVec = d_keyVec;
  valVec = d_valVec;

  std::cout << "After:\n  keyVec = ";
  print_vec(keyVec);
  std::cout << "  valVec = ";
  print_vec(valVec);
}

更新

thrust::sort_by_key我与版本进行了快速比较，thrust::partition实现似乎确实更快（这是我们自然可以期待的）。这是我在 NVIDIA Visual Profiler 上获得的内容，左侧N = 1024 * 1024是排序版本，右侧是分区版本。您可能希望自己进行相同类型的测试。

排序与分区

score 0 · Accepted Answer

您应该能够通过修改比较运算符来实现这一目标：

struct my_compare
{
  __device__ __host__ bool operator()(const float x, const float y) const
  {
    return !((x<0.0f) && (y>0.0f));
  }
};


thrust::sort_by_key(thrust::device_ptr<float> (A), 
            thrust::device_ptr<float> ( A + N ),  
            thrust::device_ptr<myStruct> ( B ),  
            my_compare() );

score 0 · Accepted Answer

这个怎么样？：

计算多少个正数来确定拐点
将拐点的每一侧平均分成组（负组的长度相同，但与正组的长度不同。这些组是结果的记忆块）
每个块对使用一个内核调用（一个线程）
每个内核将输入组中任何不合适的元素交换到所需的输出组中。您将需要标记交换次数超过最大值的任何块，以便您可以在后续迭代中修复它们。
重复直到完成

内存流量仅是交换（从原始元素位置到排序位置）。我不知道这个算法是否听起来像任何已经定义的东西......

sorting - 如何在 CUDA 上对数组进行部分排序？

问题

推力示例

3 回答 3

更新

Related

Reference