algorithm - Timsort 中运行的大小

Question

我正在从这个链接 hacknoon Timsort学习Timsort

有一种说法“当运行次数等于或略小于 2 的幂时，合并 2 个数组效率更高。Timsort 选择 minrun 来尝试确保这种效率，方法是确保 minrun 等于或小于2的幂。”

为什么“当运行次数等于或略小于 2 的幂时，合并 2 个数组效率更高”？

score 0 · Accepted Answer

它用于一般/随机情况下的平衡合并（如果给定的数据不是随机的但具有较长的自然运行，那么minrun它的选择可能并不重要，并且平衡更多地取决于运行的长度而不是运行次数）。

在一般/随机情况下，您希望生成精确minrun元素的运行。然后，当运行次数是 2 的幂时，您将获得完全平衡的合并。如果运行次数略大于2 的幂，则最终会出现非常不平衡的合并。如果运行次数略小于2 的幂，则只会出现一点不平衡。

同样，这（至少大部分）适用于一般/随机情况。例如，如果您有9 个长度为 800、100、100、100、100、100、100、100、100 个元素的自然游程，那么尽管游程数略大于2.

摘自蒂姆自己listsort.txt关于此选择的描述minrun：

Because sortperf.py only tries powers of 2, it took a long time to notice
that 32 isn't a good choice for the general case!  Consider N=2112:

>>> divmod(2112, 32)
(66, 0)
>>>

If the data is randomly ordered, we're very likely to end up with 66 runs
each of length 32.  The first 64 of these trigger a sequence of perfectly
balanced merges (see next section), leaving runs of lengths 2048 and 64 to
merge at the end.  The adaptive gimmicks can do that with fewer than 2048+64
compares, but it's still more compares than necessary, and-- mergesort's
bugaboo relative to samplesort --a lot more data movement (O(N) copies just
to get 64 elements into place).

If we take minrun=33 in this case, then we're very likely to end up with 64
runs each of length 33, and then all merges are perfectly balanced.  Better!

algorithm - Timsort 中运行的大小

1 回答 1

Related

Reference