python - Python：以另一列为条件的数据框中列表的元素平均值

Question

我有一个看起来像这样的数据框，其中包含三列（10 个不同的刺激、16 个试验和一个包含相等长度列表的数据列）。我只想根据刺激获得数据列的元素平均值。因为我有 10 种不同的刺激，它应该为每个刺激产生 10 个阵列，这也是所有数据阵列在试验中的平均值。

我想过这样的事情，但它给了我一些非常奇怪的东西。

df.groupby('stimulus').apply(np.mean)
>> IndexError: tuple index out of range

构建我的数据框

trial_vec       = np.tile(np.arange(16)+1, 10)     
stimulus_vec    = np.repeat([-2., -1.75, -1., -0.75, -0.5,  0.5,  1.,  1.25,  1.75,  2.5 ], 16)                  
data_vec        = np.random.randint(0, 16, size=160)
df              = pd.DataFrame({'trial': trial_vec, 'stimulus': stimulus_vec, 'data': data_vec}).astype('object')
df["data"]      = [np.random.rand(4).tolist() for i in range(160)]
df

score 5 · Accepted Answer

您可以将每个组转换为 2d 列表，以确保当数据列的每个单元格中的元素数量相同时data，对象可以转换为 2d numpy 数组，然后接管（按列平均）：meanaxis=0

df.groupby('stimulus').data.apply(lambda g: np.mean(g.values.tolist(), axis=0))

#stimulus
#-2.00    [0.641834320107, 0.427639804593, 0.42733812964...
#-1.75    [0.622484839138, 0.529860126072, 0.63310754064...
#-1.00    [0.546323060494, 0.465573022088, 0.54947320390...
#-0.75    [0.431675052484, 0.367636755052, 0.45263194597...
#-0.50    [0.423135952819, 0.544110613089, 0.55496058720...
# 0.50    [0.421858616927, 0.439204977418, 0.43153540636...
# 1.00    [0.612239664017, 0.499305567037, 0.46284515082...
# 1.25    [0.498544756769, 0.481073640317, 0.43564801829...
# 1.75    [0.51821909334, 0.44904063908, 0.358509374567,...
# 2.50    [0.465606275355, 0.516448419224, 0.33715002349...
#Name: data, dtype: object

或者stack数据作为二维数组，然后mean接管axis=0：

df.groupby('stimulus').data.apply(lambda g: np.mean(np.stack(g), axis=0))

编辑：如果数据列中有nans，则可以使用不带s 的计算：np.nanmeanmeannan

df.groupby('stimulus').data.apply(lambda g: np.nanmean(np.stack(g), axis=0))

score 3 · Accepted Answer

更新

对于不在当前 DataFrame 中的石斑鱼来说，这实际上是一个罕见的用例。

df['data'].apply(pd.Series).groupby(df['stimulus']).mean()

原来的

我不确定你到底想做什么，但你通常不应该在你的数据框中有列表。我会先正确格式化您的数据，然后按组取每列的平均值。

data_proper = df['data'].apply(pd.Series)
df_new = pd.concat([df.drop('data',axis=1), data_proper], axis=1)
df_new.head()

  stimulus trial         0         1         2         3
0       -2     1  0.046361  0.967723  0.707726  0.708462
1       -2     2  0.270566  0.778324  0.638878  0.276983
2       -2     3  0.261356  0.563411  0.639114  0.111150
3       -2     4  0.124745  0.532362  0.869781  0.142513
4       -2     5  0.707596  0.137417  0.493232  0.098975

df_new.groupby('stimulus').mean()

                 0         1         2         3
stimulus                                        
-2.00     0.516795  0.458579  0.527230  0.360560
-1.75     0.418950  0.497287  0.442577  0.518487
-1.00     0.569175  0.350724  0.429025  0.562950
-0.75     0.474533  0.517560  0.472101  0.658333
-0.50     0.481185  0.426829  0.414059  0.571252
 0.50     0.432719  0.563101  0.421617  0.531289
 1.00     0.478947  0.412383  0.458543  0.590503
 1.25     0.596648  0.520953  0.515184  0.513206
 1.75     0.492729  0.524673  0.567336  0.465172
 2.50     0.369798  0.540603  0.499210  0.605297

或者在受@Scott Boston 启发的一条连续线上

df.drop('data', axis=1)\
  .assign(**df.data.apply(pd.Series).add_prefix('col'))\
  .groupby('stimulus').mean()

score 0 · Accepted Answer

通过使用reduce和operator.add

import numpy as np
import pandas as pd
import operator
from functools import reduce
df.groupby('stimulus').data.apply(lambda l : np.array(list(reduce(lambda x, y: map(operator.add, x,y), l)))/len(l))

python - Python：以另一列为条件的数据框中列表的元素平均值

构建我的数据框

3 回答 3

更新

原来的

Related

Reference