string - 高效的字符串排序算法

Question

通过比较对字符串进行排序（例如标准 QuickSort + strcmp 类函数）可能有点慢，特别是对于共享公共前缀的长字符串（比较函数需要 O(s) 时间，其中 s 是字符串的长度），因此标准解决方案的复杂度为 O(s * nlog n)。有没有已知的更快的算法？

score 17 · Accepted Answer

如果您知道字符串仅包含某些字符（几乎总是如此），您可以使用BucketSort或RadixSort的变体。

score 11 · Accepted Answer

11

你可以建立一个trieO(s*n)，我相信它应该是。

于 2011-08-07T12:05:12.380 回答

score 2 · Accepted Answer

请搜索“Sedgewick Multikey quick sort”（Sedgewick 用 C 和 Java 编写了著名的算法教科书）。他的算法相对容易实现并且速度相当快。它避免了您在上面谈论的问题。有声称更快的突发排序算法，但我不知道任何实现。

有一篇文章Fast String Sort in C# and F# 描述了该算法并引用了 Sedgewick 的代码以及 C# 代码。（披露：这是我根据 Sedgewick 的论文编写的文章和代码）。

score 0 · Accepted Answer

在我看来，三向基数/字符串快速排序是最快的字符串排序算法之一。此外，MSD 基数排序是一种很好的方法。它们在 Sedgewick 的优秀算法书中进行了介绍。

以下是对从此处获取的 leipzig1M.txt进行排序的一些结果：

$ wc leipzig1M.txt
#  lines       words       characters
  1'000'000  21'191'455   129'644'797    leipzig1M.txt

方法	时间
霍尔	7.8792s
Quick3Way	7.5074s
Fast3Way	5.78015s
基数排序	4.86149s
Quick3String	4.3685s
堆排序	32.8318s
合并排序	16.94 秒
std::sort/ introsort	6.10666s
默沙东+Q3S	3.74214s

三路基数/字符串快速排序的迷人之处在于它的实现非常简单，实际上只有大约十行源代码。

template<typename RandomIt>
void insertion_sort(RandomIt first, RandomIt last, size_t d)
{
    const int len = last - first;
    for (int i = 1; i < len; ++i) {
        // insert a[i] into the sorted sequence a[0..i-1]
        for (int j = i; j > 0 && std::strcmp(&(*(first+j))[d], &(*(first+j-1))[d]) < 0; --j)
            iter_swap(first + j, first + j - 1);
    }
}

template<typename RandomIt>
void quick3string(RandomIt first, RandomIt last, size_t d)
{
    if (last - first < 2) return;
#if 0 // seems not to help much
    if (last - first <= 8) { // change the threshold as you like
        insertion_sort(first, last, d);
        return;
    }
#endif
    typedef typename std::iterator_traits<RandomIt>::value_type String;
    typedef typename string_traits<String>::value_type CharT;
    typedef std::make_unsigned_t<CharT> UCharT;

    RandomIt lt = first, i = first + 1, gt = last - 1;
    /* make lo = median of {lo, mid, hi} */
    RandomIt mid = lt + ((gt - lt) >> 1);
    if ((*mid)[d] < (*lt)[d]) iter_swap(lt, mid);
    if ((*mid)[d] < (*gt)[d]) iter_swap(gt, mid);
    // now mid is the largest of the three, then make lo the median
    if ((*lt)[d] < (*gt)[d]) iter_swap(lt, gt);

    UCharT pivot = (*first)[d];
    while (i <= gt) {
        int diff = (UCharT) (*i)[d] - pivot;
        if      (diff < 0) iter_swap(lt++, i++);
        else if (diff > 0) iter_swap(i, gt--);
        else               ++i;
    }
    // Now a[lo..lt-1] < pivot = a[lt..gt] < a[gt+1..hi].
    quick3string(first, lt, d);      // sort a[lo..lt-1]
    if (pivot != '\0')
        quick3string(lt, gt+1, d+1); // sort a[lt..gt] on following character
    quick3string(gt+1, last, d);     // sort a[gt+1..hi]
}

/*
 * Three-way string quicksort.
 * Similar to MSD radix sort, we first sort the array on the leading character
 * (using quicksort), then apply this method recursively on the subarrays. On
 * first sorting, a pivot v is chosen, then partition it in 3 parts, strings
 * whose first character are less than v, equal to v, and greater than v. Just
 * like the partitioning in classic quicksort but with comparing only the 1st
 * character instead of the whole string. After partitioning, only the middle
 * (equal-to-v) part can sort on the following character (index of d+1). The
 * other two recursively sort on the same depth (index of d) because these two
 * haven't been sorted on the dth character (just partitioned them: <v or >v).
 *
 * Time complexity: O(N~N*lgN), space complexity: O(lgN).
 * Explaination: N * string length (for partitioning, find equal-to-v part) +
 *               O(N*lgN) (to do the quicksort thing)
 * character comparisons (instead of string comparisons in normal quicksort).
 */
template<typename RandomIt>
void str_qsort(RandomIt first, RandomIt last)
{
    quick3string(first, last, 0);
}

注意：但是如果你喜欢我在 Google 上搜索“最快的字符串排序算法”，那么它很可能是burstsort，一种缓存感知的 MSD 基数排序变体（论文）。我还发现Bentley 和 Sedgewick 的这篇论文很有帮助，它使用了 Multikey Quicksort。

任何人都知道一些更有趣的字符串排序算法，请帮我编辑。谢谢：）

string - 高效的字符串排序算法

4 回答 4

Related

Reference