c++ - 一种从索引表中进行保序选择的并行算法

Question

从索引表中保持顺序的选择在串行代码中是微不足道的，但在多线程中就不那么简单了，特别是如果想要通过避免链表来保持效率（多线程的全部意义）。考虑串行代码

template<typename T>
std::vector<T> select_in_order(
  std::vector<std::size_t> const&keys, // permutation of 0 ... key.size()-1
  std::vector<T> const&data)           // anything copyable
{ // select data[keys[i]] allowing keys.size() >= data.size()
  std::vector<T> result;
  for(auto key:keys)
    if(key<data.size())
      result.push_back(data[key]);
  return result;
}

我怎么能做这个多线程（比如使用 TBB 甚至 OpenMP），特别是如果data.size() < key.size()？

score 3 · Accepted Answer

您正在寻找的并行计算操作称为Stream Compaction。

流压缩

它可以有效地并行实现，尽管算法并不简单。您最好的选择是使用已经实现它的库，例如Thrust。但是，如果您真的想自己实现，可以在GPU 编程第 39.3.1 章或Udacity 的并行编程简介课程第 4.5 课中找到对该算法的解释。

本质上，它涉及为您的数组定义一个布尔谓词（在您的示例中key<data.size()），将其映射到一个单独的数组，对谓词数组进行Scan，然后执行Scatter。

Map()并且Scatter()易于并行实现；的实现Scan()是不平凡的部分。大多数并行库都会有一个Scan()实现；如果不是，上面的链接都描述了几种并行扫描算法。

这一切都假设你有很多核心，比如在 GPU 上。在 CPU 上，串行执行可能会更快；或者将数组分成大块，串行处理块（在不同的内核上并行），然后将结果合并在一起。哪种方法最好取决于您的数据（如果大多数键都在最终数组中，则前者效果更好）。

score 2 · Accepted Answer

Partition your keys among your threads, e.g. with N threads you would give T1 the keys {0, key.size() / N - 1}, T2 gets the keys {key.size() / N, 2 * key.size() / N - 1}, etc., and TN gets the keys {(N - 1) / N * keys.size(), keys.size() - 1}. Each thread puts its results in a thread-local container, and you merge the containers when the threads have finished. This way you don't need to perform any synchronization between the threads.

The most efficient way to merge the containers is for the containers to be linked lists, since it's easy to append T2's list to T1's list and so on. However, as you said it's a good idea to avoid linked lists since they don't parallelize well.

Another option is to have each thread store its results in a thead-local array, and to then merge these arrays when the threads have completed; you can perform this merge in parallel (the size of each thread's results is T{N}results.size(); given the final merge array final_results, T1 merges its data to final_results[0, T1results.size()-1], T2 merges its data to final_results[T1results.size(), T1results.size() + T2results.size()-1], T3 merges its results to final_results[T1results.size() + T2results.size(), T1results.size() + T2results.size + T3results.size()-1], etc).

Another option is to use a shared concurrent_hash_map from TBB, with key as the key and data[key] as the value.

c++ - 一种从索引表中进行保序选择的并行算法

2 回答 2

Related

Reference