algorithm - 返回给定间隔的插入项目

Question

如何设计一种内存高效的系统，该系统接受添加到其中的项目并允许在给定时间间隔的情况下检索项目（即返回在时间 T1 和时间 T2 之间插入的项目）。不涉及数据库。存储在内存中的项目。涉及的数据结构和相关算法是什么。

更新：与数据查询相比，假设插入率极高。

score 1 · Accepted Answer

您可以使用排序的数据结构，其中键是按到达时间。请注意以下事项：

物品没有被移除
项目按顺序插入[如果项目i是在项目之后插入的，j则key(i)>key(j)]。

出于这个原因，不鼓励 tree ，因为它是“压倒性的”，并且插入它是O(logn)，您可以在其中获得O(1)插入。我建议使用以下方法之一：

(1)数组：数组总是在其末尾被填满。当分配的数组已满时，重新分配一个更大的 [双倍大小] 数组，并将现有数组复制到其中。
优点：在数组中通常期望良好的缓存O(1)，机械化插入，最多使用空间2*elementSize*#elemetns
缺点：高延迟：当数组已满时，需要O(n)添加一个元素，因此您需要期望偶尔会有是昂贵的操作。

(2)跳过列表跳过列表还允许您在末尾进行O(logn)查找和O(1)插入，但它没有延迟问题。但是，它会比数组更受缓存未命中的影响。使用的空间平均elementSize*#elements + pointerSize*#elements*2用于跳过列表。
优点：O(1)插入，没有昂贵的操作。
缺点：预计会出现糟糕的缓存。

建议：
如果延迟不是问题，我建议使用数组。如果是，您最好使用跳过列表。

在两者中，找到所需的间隔是：

findInterval(T1,T2):
  start <- data.find(T1)
  end <- data.find(T2)
  for each element in data from T1 to T2:
     yield element

score 0 · Accepted Answer

We already have an answer that suggests trees, but I think we need to be more specific: the only situation in which this is really a good solution is if you are very specific about how you build up the tree (and then I would say it's on par with the skip lists suggested in a different answer; ). The objective is to keep the tree as full as possible to the left - I'll make clearer what that means in the following. Make sure each node has a pointer to its (up to) two children and to its parent and knows the depth of the subtree rooted at that node.

Keep a pointer to the root node so that you are able to do lookups in O(log(n)), and keep a pointer to the last inserted node N (which is necessarily the node with the highest key - its timestamp will be the highest). When you are inserting a node, check how many children N has:

If 0, then replace N with the new node you are inserting and make N its left child. (At this point you'll need to update the tree depth field of at most O(log(n)) nodes.)
If 1, then add the new node as its right child.
If 2, then things get interesting. Go up the tree from N until either you find a node that has only 1 child, or the root. If you find a node with only 1 child (this is necessarily the left child), then add the new node as its new right child. If all nodes up to the root have two children, then the current tree is full. Add the new node as the new root node and the old root node as its left child. Don't change the old tree structure otherwise.

Addendum: in order to make cache behaviour and memory overhead better, the best solution is probably to make a tree or skip list of arrays. Instead of every node having a single time stamp and a single value, make every node have an array of, say, 1024 time stamps and values. When an array fills up you add a new one in the top level data structure, but in most steps you just add a single element to the end of the "current array". This wouldn't affect big-O behaviour with respect to either memory or time, but it would reduce the overhead by a factor of 1024, while latency is still very small.

score 0 · Accepted Answer

关系区间树怎么样（将您的项目编码为仅包含单个元素的区间，例如 [a,a]）？虽然，已经说过预期操作的比例很重要（实际上很多）。但这是我的两分钱：

我想在时间 t(X) 插入的项目 X 与该时间戳相关联，对吗？这意味着您现在不插入具有一周前时间戳或其他内容的项目。如果是这种情况，请使用简单数组并进行插值搜索或类似的操作（您的项目将根据您的查询引用的属性进行排序，即时间 t(X)）。

score 0 · Accepted Answer

BTree 或二叉搜索树都可能是完成上述任务的良好内存数据结构。只需将时间戳保存在每个节点中，您就可以进行范围查询。

score 0 · Accepted Answer

您可以将它们全部添加到一个简单的数组中并对其进行排序。

进行二分搜索以定位T1和T2。它们之间的所有数组元素都是您要查找的。

如果仅在添加所有元素后完成搜索，这将很有帮助。如果没有，您可以使用AVL或红黑树

algorithm - 返回给定间隔的插入项目

5 回答 5

Related

Reference