-1

在 Pandas 中,我有一个由两组组成的数据框,每组中有几个样本。每个组都有一个内部参考值,我想从该组内的所有样本值中减去该值。

s = u"""Group    sample    value
group1    ref1    18.1
group1    smp1    NaN
group1    smp2    20.3
group1    smp3    30.0
group2    ref2    16.1
group2    smp4    29.2
group2    smp5    19.9
group2    smp6    28.9
"""
df = pd.read_csv(io.StringIO(s), sep='\s+')
df = df.set_index(['Group', 'sample'])
df

Out[82]: 

                 value    
Group    sample
group1   ref1    18.1
         smp1    NaN
         smp2    20.3
         smp3    30.0
group2   ref2    16.1
         smp4    29.2
         smp5    19.9
         smp6    28.9

我想要做的是添加一个新列,其中参考(ref)已从每个相应组内的所有样本(smp)中减去。像这样:

                   value   deltaValue
SampleGroup   sample              
Group1        ref      18.1    0
              smp1     NaN     NaN
              smp2     20.3    2.2
              smp3     30.0    11.9
Group2        ref2     16.1    0
              smp4     29.2    13.1
              smp5     19.9    3.8
              smp6     28.9    12.8

有谁知道如何做到这一点?谢谢!

4

2 回答 2

0

sample按列对数据框进行分组。然后遍历每个组并获取ref样本值。然后减去整列。

> df = pd.read_csv(io.StringIO(s), sep='\s+')
> df['diff'] = 0
> df_group = df.groupby('Group')
> for index, group in df_group:
      df['diff'][df.index.isin(group.index)] = group[group['sample'] == 'ref'+ str(index.split('group')[1])]['value'].values[0] - group['value']
> print df
    Group sample  value  diff
0  group1   ref1   18.1   0.0
1  group1   smp1    NaN   NaN
2  group1   smp2   20.3  -2.2
3  group1   smp3   30.0 -11.9
4  group2   ref2   16.1   0.0
5  group2   smp4   29.2 -13.1
6  group2   smp5   19.9  -3.8
7  group2   smp6   28.9 -12.8
于 2015-05-15T12:16:54.217 回答
0

这是一种没有循环的方法

首先创建一个func函数,该函数识别sample从哪个开始,ref然后计算delta值。

In [33]: def func(grp):
    ref = grp.ix[grp['sample'].str.startswith('ref'), 'value']
    grp['delta'] = grp['value'] - ref.values[0]
    return grp

使用它func并应用在dff.groupby('Group')

In [34]: dff.groupby('Group').apply(func)
Out[34]:
    Group sample  value  delta
0  group1   ref1   18.1    0.0
1  group1   smp1    NaN    NaN
2  group1   smp2   20.3    2.2
3  group1   smp3   30.0   11.9
4  group2   ref2   16.1    0.0
5  group2   smp4   29.2   13.1
6  group2   smp5   19.9    3.8
7  group2   smp6   28.9   12.8

首先你的dff应该是这样的,可以像这样创建dff = df.reset_index()

In [35]: dff
Out[35]:
    Group sample  value
0  group1   ref1   18.1
1  group1   smp1    NaN
2  group1   smp2   20.3
3  group1   smp3   30.0
4  group2   ref2   16.1
5  group2   smp4   29.2
6  group2   smp5   19.9
7  group2   smp6   28.9
于 2015-05-15T12:24:21.220 回答