python - 用另一个覆盖一个 TimeSeries

Question

我想用另一个时间序列覆盖一个时间序列的值。输入序列在所有点都有值。覆盖时间序列将具有相同的索引（即日期），但我只想覆盖某些日期的值。我想指定它的方式是有一个时间序列，其中包含我想要覆盖到该值并且NaN我不想应用覆盖的值。

也许最好用一个简单的例子来说明：

            ints  orts  outts
index
2013-04-01     1   NaN      1
2013-05-01     2    11      2
2013-06-01     3   NaN      3
2013-07-01     4     9      4
2013-08-01     2    97      5

# should become

            ints  orts  outts
index
2013-04-01     1   NaN      1
2013-05-01     2    11     11
2013-06-01     3   NaN      3
2013-07-01     4     9      9
2013-08-01     2    97     97

正如您从示例中看到的那样，我认为replaceorwhere方法不会起作用，因为替换的值依赖于索引位置而不依赖于输入值。因为我想不止一次地这样做，所以我把它放在一个函数中，并且我确实有一个如下所示的解决方案：

def overridets(ts, orts):
    tmp = pd.concat([ts, orts], join='outer', axis=1)
    out = tmp.apply(lambda x: x[0] if pd.isnull(x[1]) else x[1], axis=1)
    return out

问题是这运行速度相对较慢：在我的环境中，500 点系列需要 20 - 30 毫秒。将两个 500 点系列相乘大约需要 200 我们，所以我们谈论的速度要慢 100 倍。关于如何加快步伐的任何建议？

编辑

在@Andy 和@bmu 的帮助下，我对问题的最终解决方案如下：

def overridets(ts, orts):

     ts.name = 'outts'
     orts.name = 'orts'
     tmp = pd.concat([ts, orts], join='outer', axis=1)

     out = tmp['outts'].where(pd.isnull(tmp['orts']), tmp['orts'])
     return out

我不需要inplace=True，因为它总是包含在一个返回单个时间序列的函数中。快了近 50 倍，谢谢大家！

score 3 · Accepted Answer

将列的非 NaN 值复制到另一列的更快方法是使用 loc 和布尔掩码：

In [11]: df1
Out[11]:
            ints  orts  outts
index
2013-04-01     1   NaN      1
2013-05-01     2    11      2
2013-06-01     3   NaN      3
2013-07-01     4     9      4
2013-08-01     2    97      5

In [12]: df1.loc[pd.notnull(df1['orts']), 'outts'] = df1['orts']

In [13]: df1
Out[13]:
            ints  orts  outts
index
2013-04-01     1   NaN      1
2013-05-01     2    11     11
2013-06-01     3   NaN      3
2013-07-01     4     9      9
2013-08-01     2    97     97

这比您的功能快得多：

In [21]: df500 = pd.DataFrame(np.random.randn(500, 2), columns=['orts', 'outts'])

In [22]: %timeit overridets(df500['outts'], df500['orts'])
100 loops, best of 3: 14 ms per loop

In [23]: %timeit df500.loc[pd.notnull(df500['orts']), 'outts'] = df500['orts']
1000 loops, best of 3: 400 us per loop

In [24]: df100k = pd.DataFrame(np.random.randn(100000, 2), columns=['orts', 'outts'])

In [25]: %timeit overridets(df100k['outts'], df100k['orts'])
1 loops, best of 3: 2.67 s per loop

In [26]: %timeit df100k.loc[pd.notnull(df100k['orts']), 'outts'] = df100k['orts']
100 loops, best of 3: 9.61 ms per loop

正如@bmu 指出的那样，实际上您最好使用where：

In [31]: %timeit df500['outts'].where(pd.isnull(df500['orts']), df['orts'], inplace=True)
1000 loops, best of 3: 281 us per loop

In [32]: %timeit df100k['outts'].where(pd.isnull(df['orts']), df['orts'], inplace=True)
100 loops, best of 3: 2.9 ms per loop

score 0 · Accepted Answer

combine_first 函数内置在 Pandas 中并处理此问题：

In [62]:  df

Out [62]:
                ints  orts  outts
    2013-04-01     1   NaN      1
    2013-05-01     2    11     11
    2013-06-01     3   NaN      3
    2013-07-01     4     9      9
    2013-08-01     2    97     97

In [63]:
    df['outts'] =  df.orts.combine_first(df.ints)
    df

Out [63]:
                ints  orts  outts
    2013-04-01     1   NaN      1
    2013-05-01     2    11     11
    2013-06-01     3   NaN      3
    2013-07-01     4     9      9
    2013-08-01     2    97     97

这应该与以前的任何解决方案一样快......

In [64]:
    df500 = pd.DataFrame(np.random.randn(500, 2), columns=['orts', 'outts'])
    %timeit df500.orts.combine_first(df500.outts)

Out [64]:
    1000 loops, best of 3: 210 µs per loop

python - 用另一个覆盖一个 TimeSeries

2 回答 2

Related

Reference