multithreading - （什么时候）并行排序是实用的，你如何写一个有效的？

Question

我正在为 D 编程语言开发一个并行化库。现在我对基本原语（并行 foreach、map、reduce 和任务/未来）非常满意，我开始考虑一些更高级别的并行算法。更明显的并行化候选者是排序。

我的第一个问题是，排序算法的并行版本在现实世界中有用吗，还是它们主要是学术性的？如果它们有用，它们在哪里有用？我个人很少在我的工作中使用它们，因为我通常使用比单个 sort() 调用更粗粒度的并行度将我的所有内核固定在 100%。

其次，对于大型数组来说，快速排序似乎几乎是令人尴尬的并行，但我无法获得我认为应该获得的近线性加速。对于快速排序，唯一固有的串行部分是第一个分区。我尝试通过在每个分区之后并行排序两个子数组来并行化快速排序。在简化的伪代码中：

// I tweaked this number a bunch.  Anything smaller than this and the 
// overhead is smaller than the parallelization gains.
const  smallestToParallelize = 500; 

void quickSort(T)(T[] array) {
    if(array.length < someConstant) {
        insertionSort(array);
        return;
    }

    size_t pivotPosition = partition(array);

    if(array.length >= smallestToParallelize) {
        // Sort left subarray in a task pool thread.
        auto myTask = taskPool.execute(quickSort(array[0..pivotPosition]));
        quickSort(array[pivotPosition + 1..$]);
        myTask.workWait();
    } else {
        // Regular serial quick sort.
        quickSort(array[0..pivotPosition]);
        quickSort(array[pivotPosition + 1..$]);
    }
}

即使对于非常大的阵列，第一个分区所花费的时间可以忽略不计，与纯串行版本的算法相比，我只能在双核上获得大约 30% 的加速。我猜瓶颈是共享内存访问。关于如何消除这个瓶颈或瓶颈可能是什么的任何见解？

编辑：我的任务池有固定数量的线程，等于系统中的核心数减去 1（因为主线程也可以工作）。此外，我使用的等待类型是工作等待，即如果任务已启动但尚未完成，则调用线程workWait()会从池中窃取其他作业并执行它们，直到它正在等待的作业完成。如果任务未启动，则在当前线程中完成。这意味着等待并不是低效的。只要有工作要做，所有线程都会保持忙碌状态。

score 7 · Accepted Answer

请记住，我不是并行排序方面的专家，人们从事并行排序的研究工作，但是......

1）它们在现实世界中有用吗？

当然，如果您需要对昂贵的东西（例如字符串或更糟）进行排序并且您没有固定所有核心，它们当然是。

想想你需要根据上下文对大量动态字符串列表进行排序的 UI 代码
想想像巴恩斯小屋 n-bodies sim 这样你需要对粒子进行分类的东西

2）快速排序似乎会提供线性加速，但事实并非如此。分区步骤是一个连续的瓶颈，如果您进行分析，您会看到这一点，并且它往往会在四核上达到 2-3 倍。

如果您想在较小的系统上获得良好的加速，您需要确保每个任务的开销非常小，理想情况下，您需要确保您没有运行太多线程，即在双线程上不超过 2核。线程池可能不是正确的抽象。

如果您想在更大的系统上获得良好的加速，您需要查看基于扫描的并行排序，有这方面的论文。双音排序也很容易并行化，合并排序也是如此。并行基数排序也很有用，PPL 中有一个（如果您不反对 Visual Studio 11）。

score 3 · Accepted Answer

I'm no expert but... here is what I'd look at:

First of all, I've heard that as a rule of thumb, algorithms that look at small bits of a problem from the start tends to work better as parallel algorithms.

Looking at your implementation, try making the parallel/serial switch go the other way: partition the array and sort in parallel until you have N segments, then go serial. If you are more or less grabbing a new thread for each parallel case, then N should be ~ your core count. OTOH if your thread pool is of fixed size and acts as a queue of short lived delegates, then I'd use N ~ 2+ times your core count (so that cores don't sit idle because one partition finished faster).

Other tweaks:

skip the myTask.wait(); at the local level and rather have a wrapper function that waits on all the tasks.
Make a separate serial implementation of the function that avoids the depth check.

score 1 · Accepted Answer

"My first question is, are parallelized versions of sorting algorithms useful in the real world" - depends on the size of the data set that you are working on in the real work. For small sets of data the answer is no. For larger data sets it depends not only on the size of the data set but also the specific architecture of the system.

One of the limiting factors that will prevent the expected increase in performance is the cache layout of the system. If the data can fit in the L1 cache of a core, then there is little to gain by sorting across multiple cores as you incur the penalty of the L1 cache miss between each iteration of the sorting algorithm.

The same reasoning applies to chips that have multiple L2 caches and NUMA (non-uniform memory access) architectures. So the more cores that you want to distribute the sorting across, the smallestToParallelize constant will need to be increased accordingly.

Another limiting factor which you identified is shared memory access, or contention over the memory bus. Since the memory bus can only satisfy a certain number of memory accesses per second; having additional cores that do essentially nothing but read and write to main memory will put a lot of stress on the memory system.

The last factor that I should point out is the thread pool itself as it may not be as efficient as you think. Because you have threads that steal and generate work from a shared queue, that queue requires synchronization methods; and depending on how those are implemented, they can cause very long serial sections in your code.

score 1 · Accepted Answer

我不知道这里的答案是否适用，或者我的建议是否适用于 D。

反正 ...

假设 D 允许，总是有可能向缓存提供预取提示。有问题的核心请求将很快（不是立即）加载到某个缓存级别的数据。在理想情况下，数据将在核心开始处理数据时被获取。与数据被“冷”提取相比，预取过程更有可能或多或少地导致等待状态更少。

您仍然会受到整体缓存到 RAM 吞吐能力的限制，因此您需要对数据进行组织，以便在内核的专有缓存中存储如此多的数据，以至于它可以在那里花费相当长的时间，然后才不得不写入更新的数据。

代码和数据需要根据缓存线（每个 64 字节的获取单元）的概念进行组织，缓存线是缓存中最小的单元。这应该导致，对于两个核心，工作需要组织起来，使得每个核心的内存系统工作量是以前的一半（假设 100% 可扩展性），当只有一个核心在工作并且工作没有被组织时。对于四个核心，四分之一之多，依此类推。这是一个相当大的挑战，但绝不是不可能的，这仅取决于您在重组工作时的想象力。与往常一样，有些解决方案是无法想象的……除非有人这样做！

我不知道所见即所得的 D 与我使用的 C 相比如何——但总的来说，我认为开发可扩展应用程序的过程可以通过开发人员在其实际机器代码生成中对编译器的影响程度而得到改善。对于解释型语言，解释器会进行大量的记忆工作，以至于您可能无法从一般的“背景噪音”中辨别出改进。

我曾经写过一个多线程的 shellsort，它在两个内核上比一个内核快 70%，在三个内核上比一个内核快 100%。四个核心的运行速度比三个慢。所以我知道你面临的困境。

score 0 · Accepted Answer

我想向您指出面临类似问题的外部排序[1]。通常，这类算法主要用于处理大量数据，但它们的主要观点是它们将大块拆分为更小且不相关的问题，因此并行运行非常好。您“只”需要在之后将部分结果拼接在一起，这并不完全平行（但与实际排序相比相对便宜）。

外部合并排序也适用于未知数量的线程。您只需任意拆分工作负载，并在有一个空闲时将每个 n 个元素块分配给一个线程，直到所有工作单元完成，此时您可以开始将它们连接起来。

[1] http://en.wikipedia.org/wiki/External_sorting

multithreading - （什么时候）并行排序是实用的，你如何写一个有效的？

5 回答 5

Related

Reference