python - 使用 pandas，我如何以有效的方式按组对大型 DataFrame 进行子采样？

Question

我正在尝试根据分组对 DataFrame 的行进行子采样。这是一个例子。假设我定义了以下数据：

from pandas import *
df = DataFrame({'group1' : ["a","b","a","a","b","c","c","c","c",
                            "c","a","a","a","b","b","b","b"],
                'group2' : [1,2,3,4,1,3,5,6,5,4,1,2,3,4,3,2,1],
                'value'  : ["apple","pear","orange","apple",
                            "banana","durian","lemon","lime",
                            "raspberry","durian","peach","nectarine",
                            "banana","lemon","guava","blackberry","grape"]})

如果我按group1and分组group2，那么每组中的行数在这里：

In [190]: df.groupby(['group1','group2'])['value'].agg({'count':len})
Out[190]: 
      count
a  1  2    
   2  1    
   3  2    
   4  1    
b  1  2    
   2  2    
   3  1    
   4  1    
c  3  1    
   4  1    
   5  2    
   6  1

（如果有更简洁的计算方法，请告诉。）

我现在想构建一个 DataFrame，它从每组中随机选择一个行。我的建议是这样做：

In [215]: from random import choice
In [216]: grouped = df.groupby(['group1','group2'])
In [217]: subsampled = grouped.apply(lambda x: df.reindex(index=[choice(range(len(x)))]))
In [218]: subsampled.index = range(len(subsampled))
In [219]: subsampled
Out[219]: 
    group1  group2  value
0   b       2       pear 
1   a       1       apple
2   b       2       pear 
3   a       1       apple
4   a       1       apple
5   a       1       apple
6   a       1       apple
7   a       1       apple
8   a       1       apple
9   a       1       apple
10  a       1       apple
11  a       1       apple

哪个有效。但是，我的真实数据大约有 250 万行和 12 列。如果我通过构建自己的数据结构来执行此操作，我可以在几秒钟内完成此操作。但是，我上面的实现并没有在 30 分钟内完成（并且似乎没有内存限制）。附带说明一下，当我尝试在 R 中实现它时，我首先尝试plyr了，它也没有在合理的时间内完成；但是，使用的解决方案data.table很快就完成了。

我如何让它快速工作pandas？我想喜欢这个包裹，所以请帮忙！

score 8 · Accepted Answer

我用apply测试过，好像子组很多的时候，速度很慢。grouped 的 groups 属性是一个字典，你可以直接从中选择索引：

subsampled = df.ix[(choice(x) for x in grouped.groups.itervalues())]

编辑：从熊猫版本 0.18.1 开始，itervalues不再适用于 groupby 对象 - 您可以使用.values：

subsampled = df.ix[(choice(x) for x in grouped.groups.values())]

python - 使用 pandas，我如何以有效的方式按组对大型 DataFrame 进行子采样？

1 回答 1

Related

Reference