0

df我有以下格式的长数据框:

user_id day action1 action2 action3 action4 action5
      1   0       4       2       0       1       0
      1   1       4       2       0       1       0
      2   1       4       2       0       1       0

操作列中的值表示用户在当天采取该操作的次数。我想把它翻译成一个广泛的DataFrame但能够任意延长时间范围(比如,到 365 天)。

我可以很容易地重塑宽:

df_indexed = df.set_index(['user_id', 'day'])
df_wide = df_indexed.unstack().fillna()

我将如何为五个操作中的每一个添加填充为 0 的剩余 358 天?

4

2 回答 2

1

这类似于@ViktorKerkez 建议使用的内容pandas.merge

In [83]: df
Out[83]:
   user_id  day  action1  action2  action3  action4  action5
0        1    0        4        2        0        1        0
1        1    1        4        2        0        1        0
2        2    1        4        2        0        1        0

In [84]: days_joiner = DataFrame(dict(zip(['user_id', 'day'], zip(*list(itertools.product(df.user_id.unique(), range(365)))))))

In [85]: result = pd.merge(df, days_joiner, how='outer')

In [86]: result.head(10)
Out[86]:
   user_id  day  action1  action2  action3  action4  action5
0        1    0        4        2        0        1        0
1        1    1        4        2        0        1        0
2        2    1        4        2        0        1        0
3        1    2      NaN      NaN      NaN      NaN      NaN
4        1    3      NaN      NaN      NaN      NaN      NaN
5        1    4      NaN      NaN      NaN      NaN      NaN
6        1    5      NaN      NaN      NaN      NaN      NaN
7        1    6      NaN      NaN      NaN      NaN      NaN
8        1    7      NaN      NaN      NaN      NaN      NaN
9        1    8      NaN      NaN      NaN      NaN      NaN

In [87]: result.fillna(0).head(10)
Out[87]:
   user_id  day  action1  action2  action3  action4  action5
0        1    0        4        2        0        1        0
1        1    1        4        2        0        1        0
2        2    1        4        2        0        1        0
3        1    2        0        0        0        0        0
4        1    3        0        0        0        0        0
5        1    4        0        0        0        0        0
6        1    5        0        0        0        0        0
7        1    6        0        0        0        0        0
8        1    7        0        0        0        0        0
9        1    8        0        0        0        0        0

公平地说:这是%timeit两种方法的比较

In [90]: timeit pd.merge(df, days_joiner, how='outer')
1000 loops, best of 3: 1.33 ms per loop

In [96]: timeit df_indexed.reindex(index, fill_value=0)
10000 loops, best of 3: 146 µs per loop

我的回答慢了大约 9 倍!

于 2013-08-15T23:02:02.487 回答
0

您可以使用 MultiIndexed DataFrame,将 DataFrame 中的itertools.product所有用户和您想要的所有日期组合起来创建一个新索引,然后只需用 0 替换填充缺失值的索引。

import itertools

users = df.user_id.unique()
df_indexed = df.set_index(['user_id', 'day'])
index = pd.MultiIndex.from_tuples(list(itertools.product(users, range(365))))
reindexed = df_indexed.reindex(index, fill_value=0)
于 2013-08-15T22:35:24.230 回答