4

我有以下数据框

In[45]: data[:10]  
Out[45]:
   Z    A    beta2    M      shell
0  100  200  0.3112   197.2 -4.213
1  100  200 -0.4197   202   -1.143
2  100  200  0.03205  203    0    
3  100  201  0.2967   191   -4.434
4  100  201 -0.4893   196.1 -4.691
5  100  202  0.3084   183.4 -4.134
6  100  202 -0.4873   188.2 -4.75 
7  100  202 -0.2483   188.4 -1.106
8  100  203  0.3069   177.1 -4.355
9  101  203 -0.4956   182.5 -5.217

我的问题是,考虑到数据不是唯一的,我如何以将(Z,A)作为索引(或 MultiIndexes)的 MultiIndex 的方式对数据进行分组/转换?为了明确我的目标,这是我期望实现的:

             beta2[1] beta2[2]  beta2[3]   M[1]   M[2]   M[3]   shell[1]   shell[2]  shell[3]
   Z    A 
0  100  200  0.3112   -0.4197   0.03205    197.2  202    203    -4.213     -1.143    0
1  100  201  0.2967   0.4893    NaN        191    196.1  NaN    -4.434     -4.691    NaN
2  100  202  0.3084   -0.4873   NaN        183.4  188.2  NaN    -4.134     -4.75     NaN
3  100  203  0.3069   NaN       NaN        177.1  NaN    NaN    -4.355     NaN       NaN 
4  101  203  -0.4956  NaN       NaN        182.5  NaN    NaN    -5.217     NaN       NaN

我知道这至少涉及两个步骤,一个用于唯一性,一个用于 Z,A 中的索引,因此对其中一个步骤的任何帮助表示赞赏,此外,是否有一些数据结构可能更适合这个问题?

编辑:我发现该行:

data=data.set_index(('Z','A'))

解决了 Z,A 中的索引问题。不幸的是,这只适用于 (Z,A) 对是唯一的。

4

1 回答 1

6

我有一个未解决的问题来解决这些问题:

https://github.com/pydata/pandas/issues/388

这是一个解决方案。首先是一个简单(但不是很有效)的函数来获取组序数:

def group_position(*args):
    """
    Get group position
    """
    from collections import defaultdict
    table = defaultdict(int)

    result = []
    for tup in zip(*args):
        result.append(table[tup])
        table[tup] += 1

    return np.array(result)

IE

In [49]: group_position(df['Z'], df['A'])
Out[49]: array([0, 1, 2, 0, 1, 0, 1, 2, 0, 0])

现在将其用作辅助索引变量并取消堆栈:

In [52]: df
Out[52]: 
     Z    A    beta2      M  shell
0  100  200  0.31120  197.2 -4.213
1  100  200 -0.41970  202.0 -1.143
2  100  200  0.03205  203.0  0.000
3  100  201  0.29670  191.0 -4.434
4  100  201 -0.48930  196.1 -4.691
5  100  202  0.30840  183.4 -4.134
6  100  202 -0.48730  188.2 -4.750
7  100  202 -0.24830  188.4 -1.106
8  100  203  0.30690  177.1 -4.355
9  101  203 -0.49560  182.5 -5.217

In [53]: df['pos'] = group_position(df['Z'], df['A'])

In [54]: df.set_index(['Z', 'A', 'pos']).unstack('pos')
Out[54]: 
          beta2                       M                shell              
pos           0       1        2      0      1      2      0      1      2
Z   A                                                                     
100 200  0.3112 -0.4197  0.03205  197.2  202.0  203.0 -4.213 -1.143  0.000
    201  0.2967 -0.4893      NaN  191.0  196.1    NaN -4.434 -4.691    NaN
    202  0.3084 -0.4873 -0.24830  183.4  188.2  188.4 -4.134 -4.750 -1.106
    203  0.3069     NaN      NaN  177.1    NaN    NaN -4.355    NaN    NaN
101 203 -0.4956     NaN      NaN  182.5    NaN    NaN -5.217    NaN    NaN

最终得到它就像你展示的那样:

In [61]: result = df.set_index(['Z', 'A', 'pos']).unstack('pos')

In [62]: result.rename(columns=lambda x: '%s[%d]' % (x[0], x[1]+1)).reset_index()
Out[62]: 
     Z    A  beta2[1]  beta2[2]  beta2[3]   M[1]   M[2]   M[3]  shell[1]  shell[2]  shell[3]
0  100  200    0.3112   -0.4197   0.03205  197.2  202.0  203.0    -4.213    -1.143     0.000
1  100  201    0.2967   -0.4893       NaN  191.0  196.1    NaN    -4.434    -4.691       NaN
2  100  202    0.3084   -0.4873  -0.24830  183.4  188.2  188.4    -4.134    -4.750    -1.106
3  100  203    0.3069       NaN       NaN  177.1    NaN    NaN    -4.355       NaN       NaN
4  101  203   -0.4956       NaN       NaN  182.5    NaN    NaN    -5.217       NaN       NaN
于 2012-04-13T22:40:25.460 回答