0

我正在尝试为 Pandas 数据框中的组计算列与其滞后(移位)之间的距离。需要对组进行排序,以使班次早于一个时间步。执行此操作的标准方法是.groupby()(又名拆分),然后.apply()在每个组上使用距离函数,然后使用.concat(). 这很好用,但只有当我没有明确对分组数据框进行排序时。当我对分组数据框进行排序时,在重新加入步骤中出现错误。

这是我的示例代码,我能够重现意外行为:

import pandas as pd

def dist_apply(group):

    # when commented out, this code will run to completion (!)
    group.sort_values(by='T',inplace=True)

    group['shift'] = group['Y'].shift()
    group['dist'] = group['Y'] - group['shift']
    return group

df = pd.DataFrame(pd.DataFrame({'X': ['A', 'B', 'A', 'B', 'A', 'B'], 'T': [0.9, 0.8, 0.7, 0.9, 0.8, 0.7], 'Y': [7, 1, 8, 3, 9, 5]}))
print(df)

# split
df_g = df.groupby(['X'])
# apply
df_g = df_g.apply(dist_apply)
print(df_g)

# rejoin
df = pd.concat([df,df_g],axis=1)
print(df)

当对分组数据帧进行排序的代码被注释掉时,代码会打印出来,这是预期的:

   X    T  Y
0  A  0.9  7
1  B  0.8  1
2  A  0.7  8
3  B  0.9  3
4  A  0.8  9
5  B  0.7  5

   X    T  Y  shift  dist
0  A  0.9  7    NaN   NaN
1  B  0.8  1    NaN   NaN
2  A  0.7  8    7.0   1.0
3  B  0.9  3    1.0   2.0
4  A  0.8  9    8.0   1.0
5  B  0.7  5    3.0   2.0

   X    T  Y  X    T  Y  shift  dist
0  A  0.9  7  A  0.9  7    NaN   NaN
1  B  0.8  1  B  0.8  1    NaN   NaN
2  A  0.7  8  A  0.7  8    7.0   1.0
3  B  0.9  3  B  0.9  3    1.0   2.0
4  A  0.8  9  A  0.8  9    8.0   1.0
5  B  0.7  5  B  0.7  5    3.0   2.0

使用排序线,Traceback 看起来像这样:

Traceback (most recent call last):
  File "test.py", line 19, in <module>
    df = pd.concat([df,df_g],axis=1)
  File "/Users/me/miniconda3/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 229, in concat
    return op.get_result()
  File "/Users/me/miniconda3/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 420, in get_result
    indexers[ax] = obj_labels.reindex(new_labels)[1]
  File "/Users/me/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py", line 2236, in reindex
    target = MultiIndex.from_tuples(target)
  File "/Users/me/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py", line 396, in from_tuples
    arrays = list(lib.tuples_to_object_array(tuples).T)
  File "pandas/_libs/lib.pyx", line 2287, in pandas._libs.lib.tuples_to_object_array
TypeError: object of type 'int' has no len()

排序但不运行 concat 会为 df_g 打印这个:

     X    T  Y  shift  dist
X                          
A 2  A  0.7  8    NaN   NaN
  4  A  0.8  9    8.0   1.0
  0  A  0.9  7    9.0  -2.0
B 5  B  0.7  5    NaN   NaN
  1  B  0.8  1    5.0  -4.0
  3  B  0.9  3    1.0   2.0

这表明它的分组方式与没有排序的 df_g 的打印(上图)不同,但尚不清楚在这种情况下 concat 是如何中断的。


更新:我以为我已经通过重命名有问题的列(在本例中为“X”)并.reset_index()在合并之前使用分组数据框来解决它。

df_g.columns = ['X_g','T','Y','shift','dist']
df = pd.concat([df,df_g.reset_index()],axis=1)

按预期运行,并打印:

   X    T  Y  X  level_1 X_g    T  Y  shift  dist
0  A  0.9  7  A        2   A  0.7  8    NaN   NaN
1  B  0.8  1  A        4   A  0.8  9    8.0   1.0
2  A  0.7  8  A        0   A  0.9  7    9.0  -2.0
3  B  0.9  3  B        5   B  0.7  5    NaN   NaN
4  A  0.8  9  B        1   B  0.8  1    5.0  -4.0
5  B  0.7  5  B        3   B  0.9  3    1.0   2.0

但仔细观察,此列显示合并不正确:

    1  B  0.8  1  A        4   A  0.8  9    8.0   1.0

我正在使用带有 Python 3.7.6 的 Mac OSX | conda-forge 打包| (默认,2020 年 1 月 7 日,22:05:27)

Pandas 0.24.2 + Numpy 1.17.3 并尝试升级到 Pandas 0.25.3 和 Numpy 1.17.5,结果相同。

4

1 回答 1

0

这是暂时的工作。

重命名列以避免重复:

df_g.columns = ['X_g','T','Y','shift','dist']

将索引从 multiindex 重置为单个

df_g = df_g.reset_index(level=[0,1])

简单合并,df_g如果要保持排序组顺序,请先放置:

df = pd.merge(df_g,df)

给我

   X  level_1 X_g    T  Y  shift  dist
0  A        2   A  0.7  8    NaN   NaN
1  A        4   A  0.8  9    8.0   1.0
2  A        0   A  0.9  7    9.0  -2.0
3  B        5   B  0.7  5    NaN   NaN
4  B        1   B  0.8  1    5.0  -4.0
5  B        3   B  0.9  3    1.0   2.0

完整代码:

import pandas as pd

def dist_apply(group):

    group.sort_values(by='T',inplace=True)

    group['shift'] = group['Y'].shift()
    group['dist'] = group['Y'] - group['shift']
    return group

df = pd.DataFrame(pd.DataFrame({'X': ['A', 'B', 'A', 'B', 'A', 'B'], 'T': [0.9, 0.8, 0.7, 0.9, 0.8, 0.7], 'Y': [7, 1, 8, 3, 9, 5]}))
print(df)
df_g = df.groupby(['X'])

df_g = df_g.apply(dist_apply)

#print(df_g)

df_g.columns = ['X_g','T','Y','shift','dist']
df_g = df_g.reset_index(level=[0,1])

#print(df_g)
df = pd.merge(df_g,df)

print(df)
于 2020-01-29T12:49:49.813 回答