python - 熊猫数据框 - 从少于 X 行的组中删除值

Question

我需要从时间序列（每月频率）计算标准平均值，但我还需要从计算中排除“不完整”年份（少于 12 个月）

Numpy/scipy“工作”版本：

import numpy as np
import scipy.stats as sts

url='http://www.cpc.ncep.noaa.gov/data/indices/sstoi.indices'
npdata = np.genfromtxt(url, skip_header=1)
unique_enso_year = [int(value) for value in set(npdata[:, 0])]
nin34 = np.zeros(len(unique_enso_year))
for ind, year in enumerate(unique_enso_year):
    indexes = np.flatnonzero(npdata[:, 0]==year)
    if len(indexes) == 12:
        nin34[ind] = np.mean(npdata[indexes, 9])
    else:
        nin34[ind] = np.nan

nin34x = (nin34 - sts.nanmean(nin34)) / sts.nanstd(nin34)

array([[  1.02250000e+00,   5.15000000e-01,  -6.73333333e-01,
     -7.02500000e-01,   1.16666667e-01,   1.32916667e+00,
     -1.10333333e+00,  -8.11666667e-01,   1.51666667e-01,
      6.42500000e-01,   6.49166667e-01,   3.71666667e-01,
      4.05000000e-01,  -1.98333333e-01,  -4.79166667e-01,
      1.24666667e+00,  -1.44166667e-01,  -1.18166667e+00,
     -8.89166667e-01,  -2.51666667e-01,   7.36666667e-01,
      3.02500000e-01,   3.83333333e-01,   1.19166667e-01,
      1.70833333e-01,  -5.25000000e-01,  -7.35000000e-01,
      3.75000000e-01,  -4.50833333e-01,  -8.30000000e-01,
     -1.41666667e-02,              nan]])

熊猫尝试：

import pandas as pd
from datetime import datetime

def parse(yr, mon):
    date = datetime(year=int(yr), day=2, month=int(mon))
    return date


url='http://www.cpc.ncep.noaa.gov/data/indices/sstoi.indices'
data = pd.read_table(url, sep=' ', header=0, skiprows=0, parse_dates = [['YR', 'MON']], skipinitialspace=True, index_col=0, date_parser=parse)                     
grouped = data.groupby(lambda x: x.year)

zscore = lambda x: (x - x.mean()) / x.std()
transformed = grouped.transform(zscore)
print transformed['ANOM.3'] 

YR_MON
1982-01-02   -0.986922
1982-02-02   -1.179216
1982-03-02   -1.179216
1982-04-02   -0.885119
1982-05-02   -0.376105
1982-06-02    0.087664
1982-07-02   -0.161188
1982-08-02    0.098975
1982-09-02    0.415695
1982-10-02    1.049134
1982-11-02    1.286674
1982-12-02    1.829622
1983-01-02    1.715072
1983-02-02    1.428598
1983-03-02    0.976272
...
2012-03-02   -0.999284
2012-04-02   -0.663736
2012-05-02   -0.063283
2012-06-02    0.572491
2012-07-02    0.961020
2012-08-02    1.314227
2012-09-02    0.925699
2012-10-02    0.537170
2012-11-02    0.660793
2012-12-02   -0.169245
2013-01-02   -1.001483
2013-02-02   -0.924445
2013-03-02    0.462223
2013-04-02    1.386668
2013-05-02    0.077037
Name: ANOM.3, Length: 377, dtype: float64

这不是我想要的 .. 因为也算 2013 年（只有 5 个月）

要提取我想要的东西，我不需要做类似的事情：

(grouped.mean()['ANOM.3'][:-1] - sts.nanmean(grouped.mean()['ANOM.3'][:-1])) / sts.nanstd(grouped.mean()['ANOM.3'][:-1])

但这假设我现在已经知道去年是不完整的，然后我松开了我应该拥有 2013 年价值的 np.NAN

所以我现在试图在熊猫中进行查询，例如：

grouped2 = data.groupby(lambda x: x.year).apply(lambda sdf: sdf if len(sdf) > 11 else None).reset_index(drop=True)

这给了我“正确的值”..但这产生了一个新的数据框“没有带时间戳的索引”..我确信有一种简单而美丽的方式来做到这一点..感谢您的帮助！

score 0 · Accepted Answer

我发现了这种方式：

import pandas as pd

url='http://www.cpc.ncep.noaa.gov/data/indices/sstoi.indices'

ts_raw = pd.read_table(url, 
                        sep=' ', 
                        header=0, 
                        skiprows=0, 
                        parse_dates = [['YR', 'MON']], 
                        skipinitialspace=True, 
                        index_col=0, 
                        date_parser=parse)                     
ts_year_group = ts_raw.groupby(lambda x: x.year).apply(lambda sdf: sdf if len(sdf) > 11 else None) 
ts_range = pd.date_range(ts_year_group.index[0][1], 
                         ts_year_group.index[-1][1]+pd.DateOffset(months=1), 
                         freq="M")
ts = pd.DataFrame(ts_year_group.values, 
                  index=ts_range, 
                  columns=ts_year_group.keys())
ts_fullyears_group = ts.groupby(lambda x: x.year)
nin_anomalies = (grouped.mean()['ANOM.3'] - sts.nanmean(grouped.mean()['ANOM.3'])) / sts.nanstd(grouped.mean()['ANOM.3'])

nin_anomalies

1982    1.527215
1983    0.779877
1984   -0.970047
1985   -1.012997
1986    0.193297
1987    1.978809
1988   -1.603259
1989   -1.173755
1990    0.244837
1991    0.967632
1992    0.977449
1993    0.568807
1994    0.617893
1995   -0.270568
1996   -0.684120
1997    1.857320
1998   -0.190803
1999   -1.718612
2000   -1.287880
2001   -0.349106
2002    1.106301
2003    0.466953
2004    0.585987
2005    0.196978
2006    0.273062
2007   -0.751613
2008   -1.060856
2009    0.573715
2010   -0.642396
2011   -1.200752
2012    0.000633
Name: ANOM.3, dtype: float64

我敢肯定有更好的方法来做同样的事情：/

score 0 · Accepted Answer

这是一个解决方案，由于您的日期是每个月的 2 日，因此有时会有点骇人听闻。

开始相同：

In [205]: import pandas as pd

In [206]: from datetime import datetime

In [207]: from datetime import timedelta

In [208]: 

In [208]: def parse(yr, mon):
   .....:         date = datetime(year=int(yr), day=2, month=int(mon))
   .....:         return date
   .....: 

In [209]: 

In [209]: url='http://www.cpc.ncep.noaa.gov/data/indices/sstoi.indices'

In [210]: data = pd.read_table(url, sep=' ', header=0, skiprows=0, parse_dates = [['YR', 'MON']], skipinitialspace=True, index_col=0, date_parser=parse)                     

In [211]: grouped = data.groupby(lambda x: x.year)

获取完整年份：

In [212]: full_year = grouped['NINO1+2'].count() == 12

In [213]: full_year
Out[213]: 
1982     True
1983     True
1984     True
1985     True
1986     True
1987     True
1988     True
1989     True
1990     True
1991     True
1992     True
1993     True
1994     True
1995     True
1996     True
1997     True
1998     True
1999     True
2000     True
2001     True
2002     True
2003     True
2004     True
2005     True
2006     True
2007     True
2008     True
2009     True
2010     True
2011     True
2012     True
2013    False
dtype: bool

现在我们处理获取正确数据类型的索引并对齐。这可能会简化一点：

In [214]: strt = data.index[0] - timedelta(1)
In [215]: idx = pd.DatetimeIndex(start=strt, periods=len(full_year - 1), freq='BA-JAN')

In [216]: idx = idx + timedelta(1)  # Get to 2nd of each month

In [232]: idx
Out[232]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[1982-01-02 00:00:00, ..., 2013-01-02 00:00:00]
Length: 32, Freq: None, Timezone: None

In [233]: full_year.index = idx

这是关键步骤：

In [234]: full_year = full_year.reindex_like(data, method='ffill')

希望这是正确的：

In [235]: data.ix[full_year].tail()
Out[235]: 
            NINO1+2  ANOM  NINO3  ANOM.1  NINO4  ANOM.2  NINO3.4  ANOM.3  \
YR_MON                                                                     
2012-08-02    20.99  0.35  25.72    0.73  29.10    0.42    27.55    0.73   
2012-09-02    20.83  0.49  25.28    0.43  29.12    0.43    27.24    0.51   
2012-10-02    20.68 -0.11  24.93    0.01  29.16    0.50    26.98    0.29   
2012-11-02    21.21 -0.38  25.11    0.14  29.17    0.54    27.01    0.36   
2012-12-02    22.13 -0.68  24.91   -0.23  28.71    0.23    26.46   -0.11   

            Unnamed: 10  
YR_MON                   
2012-08-02          NaN  
2012-09-02          NaN  
2012-10-02          NaN  
2012-11-02          NaN  
2012-12-02          NaN

只需处理 data.ix[full_year] 就可以了。

python - 熊猫数据框 - 从少于 X 行的组中删除值

2 回答 2

Related

Reference