我正在尝试根据分组对 DataFrame 的行进行子采样。这是一个例子。假设我定义了以下数据:
from pandas import *
df = DataFrame({'group1' : ["a","b","a","a","b","c","c","c","c",
"c","a","a","a","b","b","b","b"],
'group2' : [1,2,3,4,1,3,5,6,5,4,1,2,3,4,3,2,1],
'value' : ["apple","pear","orange","apple",
"banana","durian","lemon","lime",
"raspberry","durian","peach","nectarine",
"banana","lemon","guava","blackberry","grape"]})
如果我按group1
and分组group2
,那么每组中的行数在这里:
In [190]: df.groupby(['group1','group2'])['value'].agg({'count':len})
Out[190]:
count
a 1 2
2 1
3 2
4 1
b 1 2
2 2
3 1
4 1
c 3 1
4 1
5 2
6 1
(如果有更简洁的计算方法,请告诉。)
我现在想构建一个 DataFrame,它从每组中随机选择一个行。我的建议是这样做:
In [215]: from random import choice
In [216]: grouped = df.groupby(['group1','group2'])
In [217]: subsampled = grouped.apply(lambda x: df.reindex(index=[choice(range(len(x)))]))
In [218]: subsampled.index = range(len(subsampled))
In [219]: subsampled
Out[219]:
group1 group2 value
0 b 2 pear
1 a 1 apple
2 b 2 pear
3 a 1 apple
4 a 1 apple
5 a 1 apple
6 a 1 apple
7 a 1 apple
8 a 1 apple
9 a 1 apple
10 a 1 apple
11 a 1 apple
哪个有效。但是,我的真实数据大约有 250 万行和 12 列。如果我通过构建自己的数据结构来执行此操作,我可以在几秒钟内完成此操作。但是,我上面的实现并没有在 30 分钟内完成(并且似乎没有内存限制)。附带说明一下,当我尝试在 R 中实现它时,我首先尝试plyr
了 ,它也没有在合理的时间内完成;但是,使用的解决方案data.table
很快就完成了。
我如何让它快速工作pandas
?我想喜欢这个包裹,所以请帮忙!