pandas - Pandas/Dask：从多索引或第二个数据帧的其他两个列中过滤数据帧？

Question

假设一个

grouped_id_date = ddf.groupby(['my_id', 'my_date']).count().compute()

，我们收到一个新的DataFrame，它计算每对存在的行数：

+------------+------------+----+-------------------+
|   my_id    |  my_date   | || | my_value (random) |
+------------+------------+----+-------------------+
| MultiIndex | MultiIndex | || | Normal Column     |
| A          | 2020-06-03 | || | 5                 |
| A          | 2020-06-04 | || | 3                 |
| B          | 2020-06-03 | || | 3                 |
| C          | 2020-06-04 | || | 4                 |
+------------+------------+----+-------------------+

现在我想回到ddf只有.loc这样的行，它们有一个my_count >3. 什么是实现这一目标的好方法？

我目前的解决方案是这样，它有效，但它就像.. 需要有一个更好的方法：

condition = None
for i, my_id_mdate_combi_data in enumerate(grouped_id_date.iterrows()): 
    if i == 1000:
        break # not sure where MaxRecursion Exceptions kicks in..
    my_id = grouped_id_date.index[i][0]
    mdate = grouped_id_date.index[i][1]
    if condition is None:
        condition = ((ddf.my_id == my_id) & (ddf.my_date == my_date))
    else:
        condition = condition | ((ddf.my_id == my_id) & (ddf.my_date == my_date))

result = ddf.loc[condition] # Works, but slow and you reach MaxRecursion Exceptions somewhere.

数据框计数 500.000.000 行，所以不应该有太多的洗牌等等。

score 0 · Accepted Answer

像这样的东西应该工作：

  grouped_id_date = grouped_id_date[grouped_id_date['my_value'] > 3]  
  valid_pairs = grouped_id_date.index.tolist()
  all_pairs = list(ddf[['my_id', 'my_date']].values)
  mask = [(my_id, my_date) in valid_pairs for (my_id, my_date) in all_pairs]

  result = ddf[mask]

这个想法是建立自己的布尔索引。您知道分组数据中的所有对都必须存在于原始数据框中ddf。您将具有所有有效对的 MultiIndex 提取到列表中。然后从 ddf 中提取所有对并检查它们是否存在。

免责声明：我没有测试此代码。逻辑应该是正确的，但可能存在导致 SyntaxError 或类似错误的隐藏错字。

score 0 · Accepted Answer

这就是我的想法：

QS_MIN_ROWS_PER_GROUP = 3

# Build groups for each my_date+my_id combination (take a look whats in there)
grouped_myid_mydate = ddf_c.groupby(['my_id', 'my_date'])

# Count amount of occurrences on that day in that id.
quotes_per_myid_mydate_all = grouped_myid_mydate.count().compute()

# Apply filter based on groupby (this now actually compares the rows per group with a pre-defined threshold).
qs_myid_mydate_combinations = quotes_per_myid_mydate_all.loc[quotes_per_myid_mydate_all.my_id>QS_MIN_ROWS_PER_GROUP]

# Get valid pairs from MultiIndex
valid_pairs = qs_myid_mydate_combinations.index.tolist()

# Build list which is searchable by a newly addded search column, which contains both values of the two columns to compare with.. Nasty

valid_pairs_formated = []
for pair in valid_pairs:
    valid_pairs_formated.append('%s;%s' % (pair[0], pair[1]))
print(valid_pairs_formated)

# Add new search-column to central `DataFrame`. This assumes no ';' in the columns!
ddf_c['pair_code'] = (ddf_c.my_id + ';' + ddf_c.my_date.astype(str))

然后我们可以过滤pair_code：valid_pairs_formated

is_in_valid_set_of_combinations = ddf_c.pair_code.isin(valid_pairs_formated)

让我们看看结果是否合理：

is_in_valid_set_of_combinations.value_counts().compute() # you can skip this

>> Output:
True     246641219
False        11377
Name: pair_code, dtype: int64

是的，好的。

# Lastly reach target: Filter the original DataFrame
ddf_c = ddf_c.loc[ddf_c.is_in_valid_set_of_combinations == True]

# And finally check the row count
len(ddf_c.index)
> 246641219

# And remove that nasty search column:
ddf_c = ddf_c.drop(columns=['pair_code'])

很多代码，用于“n”列比较......但它有效。

score 0 · Accepted Answer

如果您真的了解您的数据，您还可以做出一些假设来构建一个数值函数，该函数的计算和过滤速度更快：

我们假设my_id小于 100000 并且可以构建一个新列pair_code_numeric：

PAIR_CODE_OFFSET_FOR_SID = 100000
col_name = 'pair_code_numeric'
ddf_c[col_name] = ((ddf_c.index.dt.year * (10000 * PAIR_CODE_OFFSET_FOR_SID)) + (ddf_c.index.dt.month * (100 * PAIR_CODE_OFFSET_FOR_SID)) + ((ddf_c.index.dt.day) * PAIR_CODE_OFFSET_FOR_SID) + ddf_c.s_id)

所以出来的是：

# view data without e0x formatting
ddf_c[col_name].apply(lambda x: '%0.f' % x, meta='int64').head()

2019-05-22 09:10:00.011433    2019052200210
2019-05-22 09:10:03.690125    2019052200175
2019-05-22 09:10:04.160046    2019052200448

然后剩下的就是直截了当的groupby&locate在一个列上-

.groupby第一的：

v = True
grouped_pair_code = ddf_c.groupby([col_name])
# Count amount of rows in that pair code 
# (One a approach chosen here, but you can apply the method for everything).
quotes_per_pair_code_all = grouped_pair_code.count().compute()
if v: print('Got %s %s combintions before Q/S' % (quotes_per_pair_code_all.shape[0], col_name))

# Get valid combinations from pair_code_numeric from the groupby by counting the numbers per group. Minimum is a hundred rows (that is what is in qs_pair_code_combinations).
qs_pair_code_combis = qs_pair_code_combinations(quotes_per_pair_code_all=quotes_per_pair_code_all,                    QS_MIN_L1_ROWS_PER_DAY = 100, v=False)
ddf_c = client.persist(ddf_c)

输出：

Got 3467 pair_code_numeric combintions before Q/S
Got 2646 valid pair_code_numeric_combis

然后我们可以简单.loc地创建一个新列，它显示该行是否有效：

valid_pairs_numeric = qs_pair_code_combis.index.tolist()
ddf_c['is_in_valid_set_of_combis'] = ddf_c[col_name].isin(valid_pairs_numeric)

最后，我们可以过滤巨大的dask.DataFrame：

len(ddf_c.loc[ddf_c.is_in_valid_set_of_combis == True])
# > 246641219 (Correct after filtering)

pandas - Pandas/Dask：从多索引或第二个数据帧的其他两个列中过滤数据帧？

3 回答 3

Related

Reference