通过比较对字符串进行排序(例如标准 QuickSort + strcmp 类函数)可能有点慢,特别是对于共享公共前缀的长字符串(比较函数需要 O(s) 时间,其中 s 是字符串的长度),因此标准解决方案的复杂度为 O(s * nlog n)。有没有已知的更快的算法?
4 回答
如果您知道字符串仅包含某些字符(几乎总是如此),您可以使用BucketSort或RadixSort的变体。
你可以建立一个trieO(s*n)
,我相信它应该是。
请搜索“Sedgewick Multikey quick sort”(Sedgewick 用 C 和 Java 编写了著名的算法教科书)。他的算法相对容易实现并且速度相当快。它避免了您在上面谈论的问题。有声称更快的突发排序算法,但我不知道任何实现。
有一篇文章Fast String Sort in C# and F# 描述了该算法并引用了 Sedgewick 的代码以及 C# 代码。(披露:这是我根据 Sedgewick 的论文编写的文章和代码)。
在我看来,三向基数/字符串快速排序是最快的字符串排序算法之一。此外,MSD 基数排序是一种很好的方法。它们在 Sedgewick 的优秀算法书中进行了介绍。
以下是对从此处获取的leipzig1M.txt进行排序的一些结果:
$ wc leipzig1M.txt
# lines words characters
1'000'000 21'191'455 129'644'797 leipzig1M.txt
方法 | 时间 |
---|---|
霍尔 | 7.8792s |
Quick3Way | 7.5074s |
Fast3Way | 5.78015s |
基数排序 | 4.86149s |
Quick3String | 4.3685s |
堆排序 | 32.8318s |
合并排序 | 16.94 秒 |
std::sort/ introsort | 6.10666s |
默沙东+Q3S | 3.74214s |
三路基数/字符串快速排序的迷人之处在于它的实现非常简单,实际上只有大约十行源代码。
template<typename RandomIt>
void insertion_sort(RandomIt first, RandomIt last, size_t d)
{
const int len = last - first;
for (int i = 1; i < len; ++i) {
// insert a[i] into the sorted sequence a[0..i-1]
for (int j = i; j > 0 && std::strcmp(&(*(first+j))[d], &(*(first+j-1))[d]) < 0; --j)
iter_swap(first + j, first + j - 1);
}
}
template<typename RandomIt>
void quick3string(RandomIt first, RandomIt last, size_t d)
{
if (last - first < 2) return;
#if 0 // seems not to help much
if (last - first <= 8) { // change the threshold as you like
insertion_sort(first, last, d);
return;
}
#endif
typedef typename std::iterator_traits<RandomIt>::value_type String;
typedef typename string_traits<String>::value_type CharT;
typedef std::make_unsigned_t<CharT> UCharT;
RandomIt lt = first, i = first + 1, gt = last - 1;
/* make lo = median of {lo, mid, hi} */
RandomIt mid = lt + ((gt - lt) >> 1);
if ((*mid)[d] < (*lt)[d]) iter_swap(lt, mid);
if ((*mid)[d] < (*gt)[d]) iter_swap(gt, mid);
// now mid is the largest of the three, then make lo the median
if ((*lt)[d] < (*gt)[d]) iter_swap(lt, gt);
UCharT pivot = (*first)[d];
while (i <= gt) {
int diff = (UCharT) (*i)[d] - pivot;
if (diff < 0) iter_swap(lt++, i++);
else if (diff > 0) iter_swap(i, gt--);
else ++i;
}
// Now a[lo..lt-1] < pivot = a[lt..gt] < a[gt+1..hi].
quick3string(first, lt, d); // sort a[lo..lt-1]
if (pivot != '\0')
quick3string(lt, gt+1, d+1); // sort a[lt..gt] on following character
quick3string(gt+1, last, d); // sort a[gt+1..hi]
}
/*
* Three-way string quicksort.
* Similar to MSD radix sort, we first sort the array on the leading character
* (using quicksort), then apply this method recursively on the subarrays. On
* first sorting, a pivot v is chosen, then partition it in 3 parts, strings
* whose first character are less than v, equal to v, and greater than v. Just
* like the partitioning in classic quicksort but with comparing only the 1st
* character instead of the whole string. After partitioning, only the middle
* (equal-to-v) part can sort on the following character (index of d+1). The
* other two recursively sort on the same depth (index of d) because these two
* haven't been sorted on the dth character (just partitioned them: <v or >v).
*
* Time complexity: O(N~N*lgN), space complexity: O(lgN).
* Explaination: N * string length (for partitioning, find equal-to-v part) +
* O(N*lgN) (to do the quicksort thing)
* character comparisons (instead of string comparisons in normal quicksort).
*/
template<typename RandomIt>
void str_qsort(RandomIt first, RandomIt last)
{
quick3string(first, last, 0);
}
注意:但是如果你喜欢我在 Google 上搜索“最快的字符串排序算法”,那么它很可能是burstsort,一种缓存感知的 MSD 基数排序变体(论文)。我还发现Bentley 和 Sedgewick 的这篇论文很有帮助,它使用了 Multikey Quicksort。
任何人都知道一些更有趣的字符串排序算法,请帮我编辑。谢谢 :)