假设一个
grouped_id_date = ddf.groupby(['my_id', 'my_date']).count().compute()
,我们收到一个新的DataFrame
,它计算每对存在的行数:
+------------+------------+----+-------------------+
| my_id | my_date | || | my_value (random) |
+------------+------------+----+-------------------+
| MultiIndex | MultiIndex | || | Normal Column |
| A | 2020-06-03 | || | 5 |
| A | 2020-06-04 | || | 3 |
| B | 2020-06-03 | || | 3 |
| C | 2020-06-04 | || | 4 |
+------------+------------+----+-------------------+
现在我想回到ddf
只有.loc
这样的行,它们有一个my_count >3
. 什么是实现这一目标的好方法?
我目前的解决方案是这样,它有效,但它就像.. 需要有一个更好的方法:
condition = None
for i, my_id_mdate_combi_data in enumerate(grouped_id_date.iterrows()):
if i == 1000:
break # not sure where MaxRecursion Exceptions kicks in..
my_id = grouped_id_date.index[i][0]
mdate = grouped_id_date.index[i][1]
if condition is None:
condition = ((ddf.my_id == my_id) & (ddf.my_date == my_date))
else:
condition = condition | ((ddf.my_id == my_id) & (ddf.my_date == my_date))
result = ddf.loc[condition] # Works, but slow and you reach MaxRecursion Exceptions somewhere.
数据框计数 500.000.000 行,所以不应该有太多的洗牌等等。