c++ - 使用比较器网络对固定长度数组进行非常快速的排序

Question

我有一些性能关键代码，涉及在 C++ 中对一个非常短的固定长度数组进行排序，其中包含大约 3 到 10 个元素（参数在编译时更改）。

我突然想到，专门针对每个可能的输入大小的静态排序网络可能是一种非常有效的方法：我们进行所有必要的比较以确定我们处于哪种情况，然后进行最佳交换次数以进行排序数组。

为了应用这一点，我们使用了一些模板魔法来推断数组长度并应用正确的网络：

#include <iostream>
using namespace std;

template< int K >
void static_sort(const double(&array)[K])
{
    cout << "General static sort\n" << endl;
}

template<>
void static_sort<3>(const double(&array)[3])
{
    cout << "Static sort for K=3" << endl;
}


int main()
{

    double  array[3];

    // performance critical code.
    // ...
    static_sort(array);
    // ...

}

显然，编写所有这些代码很麻烦，所以：

有人对这是否值得努力有任何意见吗？
有谁知道这种优化是否存在于任何标准实现中，例如 std::sort ？
是否有一个简单的地方可以获取实现这种排序网络的代码？
也许可以使用模板魔术静态生成这样的排序网络。

现在我只使用带有静态模板参数的插入排序（如上），希望它会鼓励展开和其他编译时优化。

欢迎您的想法。

更新： 我编写了一些测试代码来比较“静态”插入短和 std::sort。（当我说静态时，我的意思是数组大小是固定的，并在编译时推导出来（可能允许循环展开等）。我得到至少 20% 的 NET 改进（请注意，生成包含在时间中）。平台：铿锵声，OS X 10.9。

代码在这里https://github.com/rosshemsley/static_sorting如果您想将它与您的 stdlib 实现进行比较。

我还没有为比较器网络分类器找到一套不错的实现。

score 18 · Accepted Answer

这是一个使用 Bose-Nelson 算法在编译时生成排序网络的小类。

/**
 * A Functor class to create a sort for fixed sized arrays/containers with a
 * compile time generated Bose-Nelson sorting network.
 * \tparam NumElements  The number of elements in the array or container to sort.
 * \tparam T            The element type.
 * \tparam Compare      A comparator functor class that returns true if lhs < rhs.
 */
template <unsigned NumElements, class Compare = void> class StaticSort
{
    template <class A, class C> struct Swap
    {
        template <class T> inline void s(T &v0, T &v1)
        {
            T t = Compare()(v0, v1) ? v0 : v1; // Min
            v1 = Compare()(v0, v1) ? v1 : v0; // Max
            v0 = t;
        }

        inline Swap(A &a, const int &i0, const int &i1) { s(a[i0], a[i1]); }
    };

    template <class A> struct Swap <A, void>
    {
        template <class T> inline void s(T &v0, T &v1)
        {
            // Explicitly code out the Min and Max to nudge the compiler
            // to generate branchless code.
            T t = v0 < v1 ? v0 : v1; // Min
            v1 = v0 < v1 ? v1 : v0; // Max
            v0 = t;
        }

        inline Swap(A &a, const int &i0, const int &i1) { s(a[i0], a[i1]); }
    };

    template <class A, class C, int I, int J, int X, int Y> struct PB
    {
        inline PB(A &a)
        {
            enum { L = X >> 1, M = (X & 1 ? Y : Y + 1) >> 1, IAddL = I + L, XSubL = X - L };
            PB<A, C, I, J, L, M> p0(a);
            PB<A, C, IAddL, J + M, XSubL, Y - M> p1(a);
            PB<A, C, IAddL, J, XSubL, M> p2(a);
        }
    };

    template <class A, class C, int I, int J> struct PB <A, C, I, J, 1, 1>
    {
        inline PB(A &a) { Swap<A, C> s(a, I - 1, J - 1); }
    };

    template <class A, class C, int I, int J> struct PB <A, C, I, J, 1, 2>
    {
        inline PB(A &a) { Swap<A, C> s0(a, I - 1, J); Swap<A, C> s1(a, I - 1, J - 1); }
    };

    template <class A, class C, int I, int J> struct PB <A, C, I, J, 2, 1>
    {
        inline PB(A &a) { Swap<A, C> s0(a, I - 1, J - 1); Swap<A, C> s1(a, I, J - 1); }
    };

    template <class A, class C, int I, int M, bool Stop = false> struct PS
    {
        inline PS(A &a)
        {
            enum { L = M >> 1, IAddL = I + L, MSubL = M - L};
            PS<A, C, I, L, (L <= 1)> ps0(a);
            PS<A, C, IAddL, MSubL, (MSubL <= 1)> ps1(a);
            PB<A, C, I, IAddL, L, MSubL> pb(a);
        }
    };

    template <class A, class C, int I, int M> struct PS <A, C, I, M, true>
    {
        inline PS(A &a) {}
    };

public:
    /**
     * Sorts the array/container arr.
     * \param  arr  The array/container to be sorted.
     */
    template <class Container> inline void operator() (Container &arr) const
    {
        PS<Container, Compare, 1, NumElements, (NumElements <= 1)> ps(arr);
    };

    /**
     * Sorts the array arr.
     * \param  arr  The array to be sorted.
     */
    template <class T> inline void operator() (T *arr) const
    {
        PS<T*, Compare, 1, NumElements, (NumElements <= 1)> ps(arr);
    };
};

#include <iostream>
#include <vector>

int main(int argc, const char * argv[])
{
    enum { NumValues = 32 };

    // Arrays
    {
        int rands[NumValues];
        for (int i = 0; i < NumValues; ++i) rands[i] = rand() % 100;
        std::cout << "Before Sort: \t";
        for (int i = 0; i < NumValues; ++i) std::cout << rands[i] << " ";
        std::cout << "\n";
        StaticSort<NumValues> staticSort;
        staticSort(rands);
        std::cout << "After Sort: \t";
        for (int i = 0; i < NumValues; ++i) std::cout << rands[i] << " ";
        std::cout << "\n";
    }

    std::cout << "\n";

    // STL Vector
    {
        std::vector<int> rands(NumValues);
        for (int i = 0; i < NumValues; ++i) rands[i] = rand() % 100;
        std::cout << "Before Sort: \t";
        for (int i = 0; i < NumValues; ++i) std::cout << rands[i] << " ";
        std::cout << "\n";
        StaticSort<NumValues> staticSort;
        staticSort(rands);
        std::cout << "After Sort: \t";
        for (int i = 0; i < NumValues; ++i) std::cout << rands[i] << " ";
        std::cout << "\n";
    }

    return 0;
}

基准

以下基准测试是使用 clang -O3 编译的，并在我 2012 年中期的 macbook air 上运行。

对 100 万个数组进行排序的时间（以毫秒为单位）。
大小为 2、4、8 的数组的毫秒数分别为 1.943、8.655、20.246。

以下是 6 个元素的小数组的平均时钟。基准代码和示例可以在这个问题中找到：
Fastest sort of fixed length 6 int array

Direct call to qsort library function   : 342.26
Naive implementation (insertion sort)   : 136.76
Insertion Sort (Daniel Stutzbach)       : 101.37
Insertion Sort Unrolled                 : 110.27
Rank Order                              : 90.88
Rank Order with registers               : 90.29
Sorting Networks (Daniel Stutzbach)     : 93.66
Sorting Networks (Paul R)               : 31.54
Sorting Networks 12 with Fast Swap      : 32.06
Sorting Networks 12 reordered Swap      : 29.74
Reordered Sorting Network w/ fast swap  : 25.28
Templated Sorting Network (this class)  : 25.01

对于 6 个元素，它的执行速度与问题中最快的示例一样快。

可在此处找到用于基准测试的代码。

它包括更多功能和进一步优化，可在真实数据上实现更强大的性能。

score 10 · Accepted Answer

其他答案很有趣而且相当不错，但我相信我可以提供一些额外的答案元素，每点：

值得付出努力吗？好吧，如果您需要对小的整数集合进行排序，并且排序网络被调整为尽可能利用一些指令，那么这可能是值得的。int下图显示了使用不同排序算法对大小为 0-14 的一百万个数组进行排序的结果。如您所见，如果您确实需要，排序网络可以提供显着的加速。

std::sort我知道使用排序网络没有标准实现；当它们没有被微调时，它们可能比直接插入排序慢。libc++std::sort有专门的算法来一次对 0 到 5 个值进行排序，但它们也不使用排序网络。我所知道的唯一使用排序网络对一些值进行排序的排序算法是Wikisort。也就是说，研究论文Applying Sorting Networks to Synthesize Optimized Sorting Libraries表明，排序网络可用于对小型数组进行排序或改进递归排序算法，例如快速排序，但前提是它们经过微调以利用特定的硬件指令.

访问对齐排序算法是某种自下而上的归并排序，显然使用双调排序网络，通过 SIMD 指令实现第一遍。显然，对于某些标量类型，该算法可能比标准库更快。
实际上，我可以提供这样的信息，原因很简单，因为我开发了一个 C++14 排序库，它恰好提供了大小为 0 到 32 的高效排序网络，实现了上一节中描述的优化。我用它来生成第一部分中的图表。我仍在研究库的排序网络部分，以提供大小最优、深度最优和交换最优网络。小的最佳分拣网络是通过蛮力找到的，而更大的分拣网络则使用文献的结果。

请注意，库中的排序算法都没有直接使用排序网络，但是您可以调整它们，以便在排序算法被赋予一个小的std::array或小的固定大小的 C 数组时选择一个排序网络：
```
using namespace cppsort;

// Sorters are function objects that can be
// adapted with sorter adapters from the
// library
using sorter = small_array_adapter<
    std_sorter,
    sorting_network_sorter
>;

// Now you can use it as a function
sorter sort;

// Instead of a size-agnostic sorting algorithm,
// sort will use an optimal sorting network for
// 5 inputs since the bound of the array can be
// deduced at compile time
int arr[] = { 2, 4, 7, 9, 3 };
sort(arr);
```
As mentioned above, the library provides efficient sorting networks for built-in integers, but you're probably out of luck if you need to sort small arrays of something else (e.g. my latest benchmarks show that they are not better than a straight insertion sort even for long long int).
You could probably use template metaprogramming to generate sorting networks of any size, but no known algorithm can generate the best sorting networks, so you might as well write the best ones by hand. I don't think the ones generated by simple algorithms can actually provide usable and efficient networks anyway (Batcher's odd-even sort and pairwise sorting networks might be the only usable ones) [Another answer seems to show that generated networks could actually work].

score 8 · Accepted Answer

对于 N<16，有已知的最佳或至少最佳长度比较器网络，因此至少有一个相当好的起点。公平地说，因为最佳网络不一定是为使用 SSE 或其他矢量算法可实现的最大并行度而设计的。

另一点是，一些 N 的一些最优网络已经是 N+1 的稍大最优网络的退化版本。

来自维基百科：

最多 10 个输入的最佳深度是已知的，它们分别为 0、1、3、3、5、5、6、6、7、7。

这就是说，我会追求实现 N={4、6、8 和 10} 的网络，因为深度约束不能通过额外的并行性来模拟（我认为）。我还认为，与“众所周知”的排序方法（例如快速排序）相比，在 SSE 的寄存器中工作（也使用一些最小/最大指令）甚至是 RISC 架构中一些相对较大的寄存器集的能力将提供明显的性能优势，因为没有指针算术和其他开销。

此外，我会追求使用臭名昭著的循环展开技巧Duff 的设备来实现并行网络。

编辑当已知输入值是正的 IEEE-754 浮点数或双精度数时，还值得一提的是，比较也可以作为整数执行。（float 和 int 必须具有相同的字节序）

score 3 · Accepted Answer

让我分享一些想法。

有人对这是否值得努力有任何意见吗？

不可能给出正确的答案。您必须分析您的实际代码才能找出答案。在我的实践中，当涉及到低级分析时，瓶颈总是不在我想的地方。

有谁知道这种优化是否存在于任何标准实现中，例如 std::sort ？

例如，Visual C++ 实现std::sort对小向量使用插入排序。我不知道使用最佳排序网络的实现。

也许可以使用模板魔术静态生成这样的排序网络

有生成排序网络的算法，例如 Bose-Nelson、Hibbard 和 Batcher 的算法。由于 C++ 模板是图灵完备的，因此您可以使用 TMP 实现它们。但是，这些算法不能保证给出理论上最少数量的比较器，因此您可能需要对最优网络进行硬编码。

c++ - 使用比较器网络对固定长度数组进行非常快速的排序

4 回答 4

Related

Reference