python - 从数据点不按时匹配的时间序列图中求和/堆叠值的算法

Question

我有一个图形/分析问题，我无法完全理解。我可以做一个蛮力，但它太慢了，也许有人有更好的主意，或者知道或快速的python库？

我有 2 个以上的时间序列数据集 (x,y) 我想聚合（并随后绘制）。问题是整个系列的 x 值不匹配，我真的不想诉诸将值复制到时间箱中。

因此，鉴于这两个系列：

S1: (1;100) (5;100) (10;100)
S2: (4;150) (5;100) (18;150)

当加在一起时，应导致：

ST: (1;100) (4;250) (5;200) (10;200) (18;250)

逻辑：

x=1 s1=100, s2=None, sum=100
x=4 s1=100, s2=150, sum=250 (note s1 value from previous value)
x=5 s1=100, s2=100, sum=200
x=10 s1=100, s2=100, sum=200
x=18 s1=100, s2=150, sum=250

我目前的想法是迭代键（x）的排序列表，保留每个系列的前一个值，并查询每个集合是否有新的 x y。

任何想法，将不胜感激！

score 1 · Accepted Answer

Something like this:

def join_series(s1, s2):
    S1 = iter(s1)
    S2 = iter(s2)
    value1 = 0
    value2 = 0
    time1, next1 = next(S1)
    time2, next2 = next(S2)
    end1 = False
    end2 = False

    while True:    
        time = min(time1, time2)
        if time == time1:
            value1 = next1
            try:
                time1, next1 = next(S1)
            except StopIteration:
                end1 = True
                time1 = time2

        if time == time2:
            value2 = next2
            try:
                time2, next2 = next(S2)
            except StopIteration:
                end2 = True
                time2 = time1

        yield time, value1 + value2

        if end1 and end2:
            raise StopIteration

S1 = ((1, 100), (5, 100), (10, 100))
S2 = ((4, 150), (5, 100), (18, 150))

for result in join_series(S1, S2):
    print(result)

It basically keeps the current value of S1 and S2, together with the next of S1 and S2, and steps through them based on which has the lowest "upcoming time". Should handle lists of different lengths to, and uses iterators all the way so it should be able to handle massive dataseries, etc, etc.

score 1 · Accepted Answer

一种可能的方法：

将所有系列的元素格式化为元组(x, y, series id)，例如(4, 150, 1) 并将它们添加到一个元组列表中，并按x 升序对其进行排序。
声明一个长度等于系列数的列表，以维护每个系列的“最后一次看到”值。
遍历步骤（1）中列表的每个元素元组，并且：

3.1 根据 tuple 中的 series id 更新“last seen”列表

3.2 当先前迭代元组的 x 与当前元组的 x 不匹配时，将“last seen”列表的所有元素相加，并将结果添加到最终列表中。

现在进行我的肮脏测试：

>>> 
S1 = ((1, 100), (5, 100), (10, 100))
S2 = ((4, 150), (5, 100), (18, 150))
>>> all = []
>>> for s in S1: all.append((s[0], s[1], 0))
...
>>> for s in S2: all.appned((s[0], s[1], 1))
...
>>> all
[(1, 100, 0), (5, 100, 0), (10, 100, 0), (4, 150, 1), (5, 100, 1), (18, 150, 1)]
>>> all.sort()
>>> all
[(1, 100, 0), (4, 150, 1), (5, 100, 0), (5, 100, 1), (10, 100, 0), (18, 150, 1)]
>>> last_val = [0]*2
>>> last_x = all[0][0]
>>> final = []
>>> for e in all:
...     if e[0] != last_x:
...             final.append((last_x, sum(last_val)))
...     last_val[e[2]] = e[1]
...     last_x = e[0]
...
>>> final.append((last_x, sum(last_val)))
>>> final
[(1, 100), (4, 250), (5, 200), (10, 200), (18, 250)]
>>>

score 1 · Accepted Answer

这是另一种方法，将更多行为放在单个数据流上：

class DataStream(object):
    def __init__(self, iterable):
        self.iterable = iter(iterable)
        self.next_item = (None, 0)
        self.next_x = None
        self.current_y = 0
        self.next()

    def next(self):
        if self.next_item is None:
            raise StopIteration()
        self.current_y = self.next_item[1]
        try:
            self.next_item = self.iterable.next()
            self.next_x = self.next_item[0]
        except StopIteration:
            self.next_item = None
            self.next_x = None
        return self.next_item

    def __iter__(self):
        return self


class MergedDataStream(object):
    def __init__(self, *iterables):
        self.streams = [DataStream(i) for i in iterables]
        self.outseq = []

    def next(self):
        xs = [stream.next_x for stream in self.streams if stream.next_x is not None]
        if not xs:
            raise StopIteration()
        next_x = min(xs)
        current_y = 0
        for stream in self.streams:
            if stream.next_x == next_x:
                stream.next()
            current_y += stream.current_y
        self.outseq.append((next_x, current_y))
        return self.outseq[-1]

    def __iter__(self):
        return self


if __name__ == '__main__':
    seqs = [
        [(1, 100), (5, 100), (10, 100)],
        [(4, 150), (5, 100), (18, 150)],
        ]

    sm = MergedDataStream(*seqs)
    for x, y in sm:
        print "%02s: %s" % (x, y)

    print sm.outseq

python - 从数据点不按时匹配的时间序列图中求和/堆叠值的算法

3 回答 3

Related

Reference