1

我一直在用 pandas groupby 和 numpy 的 np.average 计算加权平均值。问题似乎是数据中的缺失(即缺失;在数据中,而不是在权重中)。我在下面做了一个概念性的例子。

我想要的行为是,当数据丢失时,该记录的权重也会被忽略。简单地删除该行不是一种选择,因为其他数据列都填充了数据。我认为 np.ma.average 正是我所需要的,但这也给了我 NaN 结果。

有什么建议么?

df = pd.DataFrame({ 'groups': ['a','a','b','a','b','b'],
                    'data':  [3, 3, 4, 2, 2.5, np.nan],
                    'Weights': [1, 2, 1, 3, 1, 3]})

def wavg(subdf):
    series = pd.Series()
    for column in df.columns:
        series['np.mean'] = np.mean(subdf['data'])
        series['np.average (no weights)'] = np.average(subdf['data'])
        series['np.average (weighted)'] = np.average(subdf['data'], weights=subdf['Weights'])
        series['np.ma.average (weighted)'] = np.ma.average(subdf['data'], weights=subdf['Weights']) 
    return series

df.groupby('groups').apply(wavg)

这给了我

       np.mean  np.average  np.average  np.ma.average 
               (no weights)  (weighted)    (weighted)
groups              
a    2.666667    2.666667    2.5              2.5
b    3.250000    NaN         NaN          NaN

===================================== 出于好奇,这就是我最终使用的:

def wavg(subdf):
    series = pd.Series()
    for column  in columns:
        df = subdf.dropna(subset=[column])
        if len(df) == 0:
            series[str(column)] = np.nan
        else:
            series[str(column)] = np.average( df[column], weights=df['Weights'])

    return series
4

1 回答 1

1

由于np.average不能nan自行处理,因此您必须自己处理它们。最简单的方法是subdf在对其进行任何操作之前将其子集化。subdf = subdf.dropna(subset=['data'])在您的开头添加以wavg删除“数据”列中包含 NaN 的行:

def wavg(subdf):
    series = pd.Series()
    subdf = subdf.dropna(subset=['data'])

    series['np.mean'] = np.mean(subdf['data'])
    series['np.average (no weights)'] = np.average(subdf['data'])
    series['np.average (weighted)'] = np.average(subdf['data'], weights=subdf['Weights'])
    series['np.ma.average (weighted)'] = np.ma.average(subdf['data'], weights=subdf['Weights']) 

    return series

正如我在评论中建议的那样,我从wavg. 您只想为每组返回一组平均值(即,一个平均值、一个平均值、一个加权平均值、一个掩蔽平均值)。但是使用您的循环,您将为每个组重新计算四次相同的事情(因为您的 DataFrame 中有四列)。

于 2014-06-28T18:21:44.463 回答