37

I have time-indexed data:

df2 = pd.DataFrame({ 'day': pd.Series([date(2012, 1, 1), date(2012, 1, 3)]), 'b' : pd.Series([0.22, 0.3]) })
df2 = df2.set_index('day')
df2
               b
 day             
2012-01-01  0.22
2012-01-03  0.30

What is the best way to extend this data frame so that it has one row for every day in January 2012 (say), where all columns are set to NaN (here only b) where we don't have data?

So the desired result would be:

               b
 day             
2012-01-01  0.22
2012-01-02   NaN
2012-01-03  0.30
2012-01-04   NaN
...
2012-01-31   NaN

Many thanks!

4

6 回答 6

37

使用这个(从 pandas 1.1.3 开始):

ix = pd.date_range(start=date(2012, 1, 1), end=date(2012, 1, 31), freq='D')
df2.reindex(ix)

这使:

               b
2012-01-01  0.22
2012-01-02   NaN
2012-01-03  0.30
2012-01-04   NaN
2012-01-05   NaN
[...]
2012-01-29   NaN
2012-01-30   NaN
2012-01-31   NaN

对于旧版本的 pandas 替换pd.date_rangepd.DatetimeIndex.

于 2014-05-22T13:07:41.933 回答
7

您可以将通过日重新采样为频率,而无需指定fill_method参数,缺失值将根据需要NaN填充

df3 = df2.asfreq('D')
df3

Out[16]:
               b
2012-01-01  0.22
2012-01-02   NaN
2012-01-03  0.30

要回答您的第二部分,我目前想不出更优雅的方式:

df3 = DataFrame({ 'day': Series([date(2012, 1, 4), date(2012, 1, 31)])})
df3.set_index('day',inplace=True)
merged = df2.append(df3)
merged = merged.asfreq('D')
merged


Out[46]:
               b
2012-01-01  0.22
2012-01-02   NaN
2012-01-03  0.30
2012-01-04   NaN
2012-01-05   NaN
2012-01-06   NaN
2012-01-07   NaN
2012-01-08   NaN
2012-01-09   NaN
2012-01-10   NaN
2012-01-11   NaN
2012-01-12   NaN
2012-01-13   NaN
2012-01-14   NaN
2012-01-15   NaN
2012-01-16   NaN
2012-01-17   NaN
2012-01-18   NaN
2012-01-19   NaN
2012-01-20   NaN
2012-01-21   NaN
2012-01-22   NaN
2012-01-23   NaN
2012-01-24   NaN
2012-01-25   NaN
2012-01-26   NaN
2012-01-27   NaN
2012-01-28   NaN
2012-01-29   NaN
2012-01-30   NaN
2012-01-31   NaN

这构造了第二个时间序列,然后我们asfreq('D')像以前一样追加和调用。

于 2013-10-01T14:45:35.727 回答
3

这是另一种选择:首先NaN在您想要的最后一天添加一条记录,然后重新采样。这样,重采样将为您填补缺失的日期。

起始帧:

import pandas as pd
import numpy as np
from datetime import date

df2 = pd.DataFrame({ 'day': pd.Series([date(2012, 1, 1), date(2012, 1, 3)]), 'b' : pd.Series([0.22, 0.3]) })
df2= df2.set_index('day')
df2

Out:
                  b
    day 
    2012-01-01  0.22
    2012-01-03  0.30

填充框架:

df2 = df2.set_value(date(2012,1,31),'b',np.float('nan'))
df2.asfreq('D')

Out:
                b
    day 
    2012-01-01  0.22
    2012-01-02  NaN
    2012-01-03  0.30
    2012-01-04  NaN
    2012-01-05  NaN
    2012-01-06  NaN
    2012-01-07  NaN
    2012-01-08  NaN
    2012-01-09  NaN
    2012-01-10  NaN
    2012-01-11  NaN
    2012-01-12  NaN
    2012-01-13  NaN
    2012-01-14  NaN
    2012-01-15  NaN
    2012-01-16  NaN
    2012-01-17  NaN
    2012-01-18  NaN
    2012-01-19  NaN
    2012-01-20  NaN
    2012-01-21  NaN
    2012-01-22  NaN
    2012-01-23  NaN
    2012-01-24  NaN
    2012-01-25  NaN
    2012-01-26  NaN
    2012-01-27  NaN
    2012-01-28  NaN
    2012-01-29  NaN
    2012-01-30  NaN
    2012-01-31  NaN
于 2016-05-19T16:36:28.750 回答
2

马克的回答似乎不再适用于 pandas 1.1.1。

但是,使用相同的想法,以下工作:

from datetime import datetime
import pandas as pd


# get start and desired end dates
first_date = df['date'].min()
today = datetime.today()

# set index
df.set_index('date', inplace=True)

# and here is were the magic happens
idx = pd.date_range(first_date, today, freq='D')
df = df.reindex(idx)

编辑:刚刚发现这个确切的用例在文档中:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html#pandas.DataFrame.reindex

于 2020-09-14T18:10:56.377 回答
0

不完全是问题,因为在这里您知道第二个索引是一月份的所有日子,但是假设您有另一个索引来自另一个数据框 df1,它可能是不相交的并且具有随机频率。然后你可以这样做:

ix = pd.DatetimeIndex(list(df2.index) + list(df1.index)).unique().sort_values()
df2.reindex(ix)

将索引转换为列表允许人们以一种自然的方式创建更长的列表。

于 2020-05-01T18:56:25.287 回答
0
def extendframe(df, ndays):
    """
    (df, ndays) -> df that is padded by ndays in beginning and end
    """
    ixd = df.index - datetime.timedelta(ndays)
    ixu = df.index + datetime.timedelta(ndays)
    ixx = df.index.union(ixd.union(ixu))
    df_ = df.reindex(ixx)
    return df_
于 2020-05-12T16:33:03.400 回答