4

I'd like to sort a list and observe/visualize how Python's sorting algorithm Timsort is moving the elements around.

First attempt: A subclass of list which prints itself after each change:

class List(list):
    def __setitem__(self, index, value):
        list.__setitem__(self, index, value)
        print(self)

That works when I change elements myself, but during sort ... nothing:

>>> a = List([None] * 2)
>>> a[0] = 'S'
['S', None]
>>> a[1] = 'P'
['S', 'P']
>>> a.sort()
>>>

Second attempt: Print the list (in global variable a) at each comparison of elements:

class Str(str):
    def __lt__(self, other):
        print(a)
        return other > self

That does something, but the list is always... empty?

>>> a = list(map(Str, 'test'))
>>> a.sort()
[]
[]
[]
[]
[]
[]
>>>

Why do these attempts fail, and is there a way to observe what Timsort is doing?

4

2 回答 2

10

Yes, it can be observed. I did, here's a visualization:

Timsort visualization

And another one with more elements:

Timsort visualization

Why do your attempts fail?

  • The list subclass fails because sort works "inside" the list at C code level, not going through the public Python __setitem__ interface.
  • The comparison attempt fails because the list is indeed made empty during the sort, as the source code explains:
     /* The list is temporarily made empty, so that mutations performed
     * by comparison functions can't affect the slice of memory we're
     * sorting (allowing mutations during sorting is a core-dump
     * factory, since ob_item may change).
    

But then how can we observe the list during the sorting, if it's empty? Easy: We don't. Not during the sorting. But after partial sortings.

Timsort is:

  • Deterministic, so sorting a certain input always results in the same steps.
  • Nice enough to not destroy the list if an exception occurs. Instead it just "stops where it is" and leaves the list in a not fully sorted state. As the code says:
    /* [...]  Even in case of error, the
     * list will be some permutation of its input state (nothing is lost or
     * duplicated).
    

So the idea is:

  • Sort, and raise an exception at the first comparison so that Timsort stops there. Catch the exception and print the partially sorted list.
  • Sort again, from the same initial state, and raise an exception at the second comparison so that Timsort stops there. Catch the exception and print the partially sorted list.
  • Again, but with an exception at the third comparison.
  • And so on, allowing more and more comparisons, until the list is fully sorted.

Code doing that (except it doesn't show duplicate states, which happens when Timsort makes several comparisons before the next move):

import string
from random import shuffle
from itertools import count

class Str(str):
    def __lt__(self, other):
        global allowed_comparisons
        if allowed_comparisons == 0:
            raise Exception
        allowed_comparisons -= 1
        return other > self

a = list(string.ascii_letters + string.digits + '<>')
shuffle(a)

shown = None
for allowed_comparisons in count():
    try:
        b = list(map(Str, a))
        b.sort()
    except:
        pass
    state = ''.join(b)
    if state != shown:
        print(state)
        shown = state
    if list(state) == sorted(state):
        break

Output, discussed below:

k1z5Qi>jCVsfRbBWSnJOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
1kz5Qi>jCVsfRbBWSnJOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15kzQi>jCVsfRbBWSnJOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15Qkzi>jCVsfRbBWSnJOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15Qikz>jCVsfRbBWSnJOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15>QikzjCVsfRbBWSnJOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15>QijkzCVsfRbBWSnJOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15>CQijkzVsfRbBWSnJOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15>CQVijkzsfRbBWSnJOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15>CQVijkszfRbBWSnJOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15>CQVfijkszRbBWSnJOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15>CQRVfijkszbBWSnJOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15>CQRVbfijkszBWSnJOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15>BCQRVbfijkszWSnJOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15>BCQRVWbfijkszSnJOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15>BCQRSVWbfijksznJOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15>BCQRSVWbfijknszJOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15>BCJQRSVWbfijknszOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15>BCJOQRSVWbfijknszD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15>BCDJOQRSVWbfijknsz6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
156>BCDJOQRSVWbfijknszAZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
156>ABCDJOQRSVWbfijknszZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
156>ABCDJOQRSVWZbfijknsz3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDJOQRSVWZbfijknszcGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDJOQRSVWZbcfijknszGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDGJOQRSVWZbcfijknszTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDGJOQRSTVWZbcfijknszaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDGJOQRSTVWZabcfijknszIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknszrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrszvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvzw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz0wpePH94UgqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz0pwePH94UgqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz0epwPH94UgqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz0PepwH94UgqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz0HPepw94UgqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz09HPepw4UgqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz049HPepwUgqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz049HPUepwgqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz049HPUegpwqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz049HPUegpqwEK2FNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz049EHPUegpqwK2FNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz049EHKPUegpqw2FNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz0249EHKPUegpqwFNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz0249EFHKPUegpqwNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz0249EFHKNPUegpqwYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz0249EFHKNPUYegpqwL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz0249EFHKLNPUYegpqw7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz02479EFHKLNPUYegpqwXdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz02479EFHKLNPUXYegpqwdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz02479EFHKLNPUXYdegpqwltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz02479EFHKLNPUXYdeglpqwtoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz02479EFHKLNPUXYdeglpqtwoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz02479EFHKLNPUXYdeglopqtwxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz024789EFHKLNPUXYdeglopqtwxyuM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz024789EFHKLNPUXYdeglopqtuwxyM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz024789EFHKLMNPUXYdeglopqtuwxy<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz024789<EFHKLMNPUXYdeglopqtuwxyhm
1356>ABCDGIJOQRSTVWZabcfijknrsvz024789<EFHKLMNPUXYdeghlopqtuwxym
1356>ABCDGIJOQRSTVWZabcfijknrsvz024789<EFHKLMNPUXYdeghlmopqtuwxy
01356>ABCDGIJOQRSTVWZabcfijknrsvz24789<EFHKLMNPUXYdeghlmopqtuwxy
012356>ABCDGIJOQRSTVWZabcfijknrsvz4789<EFHKLMNPUXYdeghlmopqtuwxy
0123456>ABCDGIJOQRSTVWZabcfijknrsvz789<EFHKLMNPUXYdeghlmopqtuwxy
01234567>ABCDGIJOQRSTVWZabcfijknrsvz89<EFHKLMNPUXYdeghlmopqtuwxy
012345678>ABCDGIJOQRSTVWZabcfijknrsvz9<EFHKLMNPUXYdeghlmopqtuwxy
0123456789>ABCDGIJOQRSTVWZabcfijknrsvz<EFHKLMNPUXYdeghlmopqtuwxy
0123456789<>ABCDGIJOQRSTVWZabcfijknrsvzEFHKLMNPUXYdeghlmopqtuwxy
0123456789<>ABCDEGIJOQRSTVWZabcfijknrsvzFHKLMNPUXYdeghlmopqtuwxy
0123456789<>ABCDEFGIJOQRSTVWZabcfijknrsvzHKLMNPUXYdeghlmopqtuwxy
0123456789<>ABCDEFGHIJOQRSTVWZabcfijknrsvzKLMNPUXYdeghlmopqtuwxy
0123456789<>ABCDEFGHIJKOQRSTVWZabcfijknrsvzLMNPUXYdeghlmopqtuwxy
0123456789<>ABCDEFGHIJKLOQRSTVWZabcfijknrsvzMNPUXYdeghlmopqtuwxy
0123456789<>ABCDEFGHIJKLMOQRSTVWZabcfijknrsvzNPUXYdeghlmopqtuwxy
0123456789<>ABCDEFGHIJKLMNOQRSTVWZabcfijknrsvzPUXYdeghlmopqtuwxy
0123456789<>ABCDEFGHIJKLMNOPQRSTVWZabcfijknrsvzUXYdeghlmopqtuwxy
0123456789<>ABCDEFGHIJKLMNOPQRSTUVWZabcfijknrsvzXYdeghlmopqtuwxy
0123456789<>ABCDEFGHIJKLMNOPQRSTUVWXZabcfijknrsvzYdeghlmopqtuwxy
0123456789<>ABCDEFGHIJKLMNOPQRSTUVWXYZabcfijknrsvzdeghlmopqtuwxy
0123456789<>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdfijknrsvzeghlmopqtuwxy
0123456789<>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefijknrsvzghlmopqtuwxy
0123456789<>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefgijknrsvzhlmopqtuwxy
0123456789<>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijknrsvzlmopqtuwxy
0123456789<>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklnrsvzmopqtuwxy
0123456789<>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnrsvzopqtuwxy
0123456789<>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnorsvzpqtuwxy
0123456789<>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnoprsvzqtuwxy
0123456789<>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrsvztuwxy
0123456789<>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstvzuwxy
0123456789<>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvzwxy
0123456789<>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

In the above output, note there are three stages:

  1. The first half (32 elements) gets sorted until it's 1356>ABCDGIJOQRSTVWZabcfijknrsvz. Timsort does that with binary insertion sort. Each line in the output corresponds to one insertion.
  2. The second half (32 elements) gets sorted until it's 024789<EFHKLMNPUXYdeghlmopqtuwxy. Again with binary insertion sort.
  3. Timsort merges the two halves. These lines in the output don't quite show real states during the sort. Let's look at the first step:
    1356>ABCDGIJOQRSTVWZabcfijknrsvz024789<EFHKLMNPUXYdeghlmopqtuwxy
    01356>ABCDGIJOQRSTVWZabcfijknrsvz24789<EFHKLMNPUXYdeghlmopqtuwxy
    The way Timsort really merges the two halves is to move the first half out, to temporary space. And then merge the two halves into the given list from left to right. So really after the 0 from the second half gets moved to the front, the list looks like this:
    0--------------------------------24789<EFHKLMNPUXYdeghlmopqtuwxy
    With all the dashes being unoccupied space. Now when I raise my exception, Timsort doesn't just leave the list like that, but nicely at least moves the first halve's remaining elements back into that unoccupied space. And that's why in my output, it looks like Timsort is moving the 0 to the left and shifting the entire first half to the right by one index. That would be inefficient, and it's not what happens when Timsort runs normally, without my exceptions.

You can also see these three stages in my animated image at the top. And in this "screensaver" version, where I scroll down through the above output. I think it looks cool (was hoping it would look somewhat like Matrix code rain) but I find it less clear:

Timsort visualization

Note that here the third stage merges from right to left (with the right half moved out to temp space), which Timsort does when that's better.

Code for generating that image using Pillow, after storing the states in list states instead of printing them:

from PIL import Image, ImageDraw

images = []
for i in range(len(states) - 14):
    img = Image.new('RGB', (384, 225), (0, 0, 0))
    text = '\n'.join(states[i:i+15])
    d = ImageDraw.Draw(img)
    d.multiline_text((0,0), text, fill=(0, 200, 0))
    img = img.resize((384*2, 225*2), 0)
    images.append(img)
images[0].save('timatrix.png', save_all=True, append_images=images[1:],
               optimize=False, duration=100, loop=0)
于 2020-08-05T22:31:01.617 回答
0

如果您的目标是深入了解算法的操作,那么了解元素的具体值和位置就没有观察堆栈上的运行长度那么重要。这些运行长度在 Stefan Pochmann 动画中出现的锯齿中可见。

这些运行长度揭示了指针(索引)堆栈增长到数据数组的对数限制,这是基本 timsort 算法的一个重要特征。以下是对此进行更清晰表示的示例,仅显示堆栈索引:

  entry [0, 53, 106]
   exit [0, 106]
  entry [0, 106, 159, 212]
   exit [0, 106, 212]
  entry [0, 106, 212]
   exit [0, 212]
  entry [0, 212, 265, 318]
   exit [0, 212, 318]
  entry [0, 212, 318, 371, 424]
   exit [0, 212, 318, 424]
  entry [0, 212, 318, 424]
   exit [0, 212, 424]
  entry [0, 212, 424]
   exit [0, 424]
  entry [0, 424, 477, 530]
   exit [0, 424, 530]
  entry [0, 424, 530, 583, 636]
   exit [0, 424, 530, 636]
  entry [0, 424, 530, 636]
   exit [0, 424, 636]
  entry [0, 424, 636, 689, 742]
   exit [0, 424, 636, 742]
  entry [0, 424, 636, 742, 795, 836]
   exit [0, 424, 636, 742, 836]
  entry [0, 424, 636, 742, 836]
   exit [0, 424, 636, 836]
  entry [0, 424, 636, 836]
   exit [0, 424, 836]
  entry [0, 424, 836]
   exit [0, 836]
Functional trial 0 sorted 836 random data items.

索引之间的数据已按升序排序。该标签entry显示了合并例程入口处的堆栈状态,并exit显示了退出时的状态。当entry等于 previousexit时,由于不变量的重复失败,在继续创建下一个运行之前,正在合并多个运行。从栈顶索引到数组大小的数据是 timsort 在其对数组的线性扫描中尚未检查的数据。当栈顶索引等于数组大小时,timsort 已经完成扫描,但可能需要通过一些合并来完成。在此示例中, timsort 在不变量第一次失败并开始合并之前生成两次运行,并通过合并五个运行来完成。

虽然可以使用 Stefan Pochmann 的代码通过更复杂的事后分析来生成这种类型的输出,但此数据是使用此处找到的 timsort 实现生成的:

https://gist.github.com/ruminations/89a045dc0ef7edfb92304a0de0752ee0

通过取消注释方法中的两个##print ...语句merge(行263和)并在最后的测试套件中的行后303添加一个break语句。391该代码对于玩弄算法并获得对其操作的一些直观理解很有用。

于 2020-12-04T20:47:50.823 回答