algorithm - 30,000 个数据点，找出 2 周内的最大变化

Question

我有：

- 30,000 data points
- each data point is a measurement of type float
- each measurement is associated with a date
- each date has only one measurement
- no dates are without measurements
- the data comes in the form of a text file: 30,000 lines in this form:
    - YYYY-MM-DD I,F (e.g. 1977-02-08 20.74)
- measurement appearing in the source file are already sorted by date

我需要：

- a time-interval T with boundaries (s,e) /* start, end */
- (s - e = 14 days) the time-interval *must* be 2 weeks
- define min as the lowest value in the interval T
- define max as the greatest value in the interval T
- the chosen T needs to have the greatest distance btwn max and min of all possible Ts
- break ties among intervals T by choosing the most recent (with the greatest s value)
- the chosen T must consider all jumps in the 14 days, not just the values @ s and e
- if the overall "variance" in the interval is great but the jump 
  |max-min| is not the greatest in absolute value, T is not the right choice,
  even if it's an "exciting" interval

我在问：

- which algorithm to employ, considering algorithms are not my specialty
- which data structure to use to keep track of the subtotals

笔记：

- an answer in pseudo code would be preferred, "prose" is fine if pressured for time
- an answer in Python would be... splendid :)

如果需要，您可以生成“虚拟”数据并运行您提出的算法作为测试，或者我可以共享实际数据。

除了想知道最快的方法来学习如何应用正确的解决方案和正确的算法之外，我在这里不太关心性能。

我认为即使是最简单的迭代算法我也可以“证明”正确性，因为考虑到当今的计算机，数据集很小。

到目前为止，我正在“遍历并携带 14 个测量值的 14 个向量”，如果您能教我如何使用子和逐步执行此操作，那将非常感激。

score 2 · Accepted Answer

通过保留两个堆栈，滑动窗口确实在这里起作用（也许这有点误导，因为这可能最好实现为双向队列）。保留一个堆栈minstack和一个名为的堆栈maxstack。该算法的关键是 minstack 应该严格不减少，而 maxstack 应该严格不增加幻灯片的所有点。那么，我们该怎么做呢？

首先，将前 14 个点添加到堆栈中。让我们定义add(point)为：

为 minstack 执行此操作：

当该点小于 minstack 的顶部元素时，删除 minstack 的顶部元素。
将点添加到 minstack。

同样，对于 maxstack：

当新点大于 maxstack 的顶部元素时，删除 maxstack 的顶部元素。
将点添加到 maxstack。

由于上面的属性，前 14 个元素的 min 和 max 应该是 minstack 和 maxstack 的底部元素。现在滑动窗口。我们只需要注意，如果左边的点在任何堆栈中仍然“活跃”，那么它现在必然是底部点。因此，这应该很容易，很简单：

slide():
    add(new_point)
    if (left_point == bottom(minstack)) remove_bottom(minstack)
    if (left_point == bottom(maxstack)) remove_bottom(maxstack)

这样做直到你的积分用完。您要查找的区间bottom(maxstack) - bottom(minstack)是最大的区间。

请注意，任何点最多进入 minstack/maxstack 一次，每个点也最多离开堆栈一次，因此，无论所需间隔的大小是多少，每个点最多执行 4 次操作。

编辑：我刚刚注意到您想要在 Python 中实现。我真的不想解析数据，因此该函数将值列表作为输入，并输出该数组中的索引 (s,e)：

import collections

def add(x, minstack, maxstack):
    while minstack and x < minstack[-1]: minstack.pop()
    while maxstack and x > maxstack[-1]: maxstack.pop()
    minstack.append(x)
    maxstack.append(x)

def get_largest_interval(points):
    minstack = collections.deque()
    maxstack = collections.deque()

    best_diff = -1
    best_interval = None

    for index, elem in enumerate(points):
        add(elem,minstack,maxstack)
        if index >= 14:
            if minstack[0] == points[index-14]: minstack.popleft()
            if maxstack[0] == points[index-14]: maxstack.popleft()

        if index >= 13:
            this_diff = maxstack[0]-minstack[0]
            if best_diff == -1 or this_diff >= best_diff:
                best_interval = (index-13, index)
                best_diff = this_diff

    return best_interval


print get_largest_interval([0, 2, 2,2,2,2,2,2,2,2,2,2,2,2,3])

score 1 · Accepted Answer

如果我理解你，你有：

30,000 个不同的、有序的数据值。排序恰好是按日期，但这不相关。

在这个集合中，有 29,986 个子集，其中的内容是有序序列，从一个数据点开始，包含该初始点和 13 个后续数据点。

慢慢来：

1）将您的 30,000 个数据点读入一个大小为 30,000 的数组中。

2) 分配一个大小为 29,986 的数组。将此数组称为“潜在赢家”。

3) 通过扫描每个 14 点子集填充潜在赢家数组，暂时保存子集中遇到的最大值和最小值。当这两个值在手时，将 (Max-Min) 保存在“潜在赢家”中的索引位置（起点）。不要尝试任何滑动窗口优化；见下文。

4) 对潜在赢家进行线性扫描，保存价值和（重要的是）它所在的索引。

BTW：如果没有一个赢家，你会怎么做？如果所有数据点具有相同的值，您将获得 29,986 个候选获胜者，它们都具有相同的值。

5) 优化：不分配和填充潜在赢家；将 Current Winner 初始化为元组 (value, index) 为 (0, -1)。如上所述计算每个 14 点子集的值，但只保留 {Current Winner, "the value I get from this current subset"}

6) 滑动窗口：我没有考虑过，但我认为维护一个滑动窗口比上面描述的简单的线性传递更多的工作。

原因：好的，计算前14个点的值；得到一个最小值和最大值，并得到它们之间的间隔。但是等等，我们需要在下一个窗口中使用最小值和最大值。现在将窗口向上滑动一个位置。左端的值没了；但它是最小值、最大值还是介于两者之间？假设它是最小值，现在它已经消失了。第二最低的最小值是多少？我们没有那个信息。

为了保持滑动窗口，您需要对每个 14 个数据点的子序列进行排序并记住所有值的索引位置。然后，当您滑动时，您可以知道在左侧退出的值是旧最小值还是旧最大值，以及右侧进入的新值是新最小值还是新最大值。但这不值得努力。

（这种情况有点像 Boyer-Moore 快速子串查找算法。我不记得细节了，但它涉及预处理整个输入并保留每个值出现位置的表格。但这还很遥远-话题）

希望这可以帮助...

algorithm - 30,000 个数据点，找出 2 周内的最大变化

2 回答 2

Related

Reference