python - 添加到列表时对日期列表进行排序的最快方法

Question

我正在写一些类似于任务调度程序的东西。我有两组任务，有些是固定的（给定开始和结束日期和时间），有些不是固定的（给定开始日期和时间以及持续时间）。

非固定任务受到固定任务的影响，因此如果非固定任务被固定任务重叠，则非固定任务将其持续时间延长重叠量。

我从一个元组列表开始，其中第一项是开始日期，第二项是该固定任务的 ID，如下所示：

[(2012-04-30, 1), (2012-05-01, 5), (2012-05-04, 2)]

然后，我有另一个由用户订购的非固定任务列表。我的想法是我将遍历这个列表，在那个循环内我将遍历第一个列表以找到可能与此任务重叠的任务，并且可以确定哪些任务可以扩展非固定任务.

这是我请求你帮助的地方。现在我知道了这个非固定任务的计算开始和结束时间，我需要将其视为“固定”，以便它影响其余的非固定任务。

我可以将此任务添加到固定任务的第一个列表中并再次对其进行排序，但这意味着我每次添加任务时都会对列表进行排序。

我可以遍历第一个列表并找到应该插入此任务的点，然后将其插入那里。但是，如果它的位置在列表的前面，则需要花费时间将所有其他项目移到一个位置。如果它的位置在列表的后面，我将不得不遍历很多元素才能到达正确的位置。

所以，我不赞成使用这些选项中的任何一个。这里真正的问题是：在向列表添加内容的同时保持列表排序的最佳方法是什么？或者有没有更好的方法来做这一切？

score 3 · Accepted Answer

Here is the example of using bisect and comparison with using the sort of the partially sorted list. The bisect solution clearly wins:

import bisect
import random
import timeit


def bisect_solution(size=10000):
    lst = []
    for n in xrange(size):
        value = random.randint(1, 1000000)
        bisect.insort_left(lst, value)
    return lst


# Cut out of the bisect module to be used in bisect_solution2()
def insort_left(a, x, lo=0, hi=None):
    """Insert item x in list a, and keep it sorted assuming a is sorted.

    If x is already in a, insert it to the left of the leftmost x.

    Optional args lo (default 0) and hi (default len(a)) bound the
    slice of a to be searched.
    """

    if lo < 0:
        raise ValueError('lo must be non-negative')
    if hi is None:
        hi = len(a)
    while lo < hi:
        mid = (lo+hi)//2
        if a[mid] < x: lo = mid+1
        else: hi = mid
    a.insert(lo, x)


def bisect_solution2(size=10000):
    lst = []
    for n in xrange(size):
        value = random.randint(1, 1000000)
        insort_left(lst, value)
    return lst


def sort_solution(size=10000):
    lst = []
    for n in xrange(size):
        value = random.randint(1, 1000000)
        lst.append(value)
        lst.sort()
    return lst


t = timeit.timeit('bisect_solution()', 'from __main__ import bisect_solution', number = 10)
print "bisect_solution: ", t

t = timeit.timeit('bisect_solution2()', 'from __main__ import bisect_solution2', number = 10)
print "bisect_solution2: ", t

t = timeit.timeit('sort_solution()', 'from __main__ import sort_solution', number = 10)
print "sort_solution: ", t

The bisect_solution2() is almost the same as bisect_solution() -- only with the code copied-out of the module. Someone else should explain why it takes more time :)

The bisect_solution2() is here to be modified for cmp() function to be able to compare the tuples.

It shows the following results on my computer:

bisect_solution:  0.637892403587
bisect_solution2:  0.988893038133
sort_solution:  15.3521410901

Here is a bisect solution adopted for the tuples where date is a string:

import random
import timeit


def random_date_tuple():
    s1 = '{0}-{1:02}-{2:02}'.format(random.randint(2000, 2050),
                                    random.randint(1, 12),
                                    random.randint(1, 31))
    e2 = random.randint(1,50)
    return (s1, e2)


def my_cmp(a, b):
    result = cmp(a[0], b[0])   # comparing the date part of the tuple
    if result == 0:
        return cmp(a[1], b[1]) # comparint the other part of the tuple
    return result


def my_insort_left(a, x, cmp=my_cmp, lo=0, hi=None):
    """The bisect.insort_left() modified for comparison of tuples."""

    if lo < 0:
        raise ValueError('lo must be non-negative')
    if hi is None:
        hi = len(a)
    while lo < hi:
        mid = (lo+hi)//2
        if cmp(a[mid], x) < 0: 
            lo = mid+1
        else: 
            hi = mid
    a.insert(lo, x)


def bisect_solution3(size=1000):
    lst = []
    for n in xrange(size):
        value = random_date_tuple()
        my_insort_left(lst, value)
    return lst


def sort_solution(size=1000):
    lst = []
    for n in xrange(size):
        value = random_date_tuple()
        lst.append(value)
        lst.sort(cmp=my_cmp)
    return lst


t = timeit.timeit('bisect_solution3()', 'from __main__ import bisect_solution3', number = 10)
print "bisect_solution3: ", t

t = timeit.timeit('sort_solution()', 'from __main__ import sort_solution', number = 10)
print "sort_solution: ", t

print bisect_solution3()[:10]

Notice that the list size is 10 times less than in the previous as the sort solution was very slow. It prints:

bisect_solution3:  0.223602245968
sort_solution:  3.69388944301
[('2000-02-01', 20), ('2000-02-13', 48), ('2000-03-11', 25), ('2000-03-13', 43),
 ('2000-03-26', 48), ('2000-05-04', 17), ('2000-06-06', 23), ('2000-06-12', 31),
 ('2000-06-15', 15), ('2000-07-07', 50)]

score 1 · Accepted Answer

我可以遍历第一个列表并找到应该插入此任务的点，然后将其插入那里。但是，如果它的位置在列表的前面，则需要花费时间将所有其他项目移到一个位置。如果它的位置在列表的后面，我将不得不遍历很多元素才能到达正确的位置。

可以使用二分搜索在 O(log n) 中找到将某些内容插入排序列表的正确位置。插入仍然是 O(n)。

还有更复杂的数据结构，如B-Trees，允许在 O(log n) 中插入和搜索。看看这个和这个。

score 1 · Accepted Answer

这里真正的问题是：在向列表添加内容的同时保持列表排序的最佳方法是什么？

插入排序是要走的路。但你可能不喜欢它，因为你已经知道了。接下来你可以做的就是这个，

添加时不要排序。
当您获得项目时对其进行排序并缓存它。下次请求时从以前的缓存中显示。
添加任何新项目时使缓存无效。

我不是 python 程序员，但我可以通过 PHP 类给你一些想法。

class SortedList(){
    public $list = array();
    private $cached_list;

    public function add($item){
        array_push($this->list, $item);
        $this->sorted = false;
    }
    public function get(){
        if($this->sorted==true){
            return $this->cached_list;
        }

        // sort the array;

        // copying the list to cached list and sort it
        $this->cached_list = $this->list;
        sort($this->cached_list);

        // set the flag
        $this->sorted = true;
        return $this->cached_list
    }

}

score 0 · Accepted Answer

堆队列是你的朋友。来自维基百科：

堆通常执行的操作是：

create-heap : 创建一个空堆
find-max：找到最大堆的最大项
delete-max：删除最大堆的根节点
增加键：更新最大堆中的键
insert：向堆中添加一个新键
合并：连接两个堆以形成一个有效的新堆，其中包含两者的所有元素。

有一个内置的Python 堆队列实现。堆针对 1) 删除最大元素，2) 插入新元素以保持堆排序进行了优化。

python - 添加到列表时对日期列表进行排序的最快方法

4 回答 4

Related

Reference