python - 如何使用多个字符串条件加速熊猫布尔索引

Question

我有一个 7300 万行数据集，我需要过滤掉与几个条件中的任何一个匹配的行。我一直在使用布尔索引进行此操作，但这需要很长时间（约 30 分钟），我想知道是否可以使其更快（例如花式索引、np.where、np.compress？）

我的代码：

clean_df = df[~(df.project_name.isin(p_to_drop) | 
                df.workspace_name.isin(ws_to_drop) | 
                df.campaign_name.str.contains(regex_string,regex=True) | 
                df.campaign_name.isin(small_launches))]

正则表达式字符串是

regex_string = '(?i)^.*ARCHIVE.*$|^.*birthday.*$|^.*bundle.*$|^.*Competition followups.*$|^.*consent.*$|^.*DOI.*$|\
                    ^.*experiment.*$|^.*hello.*$|^.*new subscribers.*$|^.*not purchased.*$|^.*parent.*$|\
                    ^.*re engagement.*$|^.*reengagement.*$|^.*re-engagement.*$|^.*resend.*$|^.*Resend of.*$|\
                    ^.*reward.*$|^.*survey.*$|^.*test.*$|^.*thank.*$|^.*welcome.*$'

其他三个条件是少于 50 项的字符串列表。

score 2 · Accepted Answer

如果您有这么多行，我认为先一步删除记录会更快。正则表达式通常很慢，因此您可以将其用作最后一步，数据框要小得多。

例如：

clean_df = df.copy()
clean_df = clean_df.loc[~(df.project_name.isin(p_to_drop)]
clean_df = clean_df.loc[~df.workspace_name.isin(ws_to_drop)]
clean_df = clean_df.loc[~df.campaign_name.isin(small_launches)]
clean_df = clean_df.loc[~df.campaign_name.str.contains(regex_string,regex=True)]

score 1 · Accepted Answer

我曾认为链接我的条件是一个好主意，但使它们连续的答案帮助我重新思考：每次我运行布尔索引操作时，我都在使数据集更小 - 因此下一次操作更便宜。

按照建议，我已将它们分开，并将删除最多行的操作放在顶部，因此接下来的操作更快。我把正则表达式放在最后——因为它很昂贵，所以在尽可能小的 df 上做它是有意义的。

希望这对某人有帮助！TIL 链接您的操作看起来不错，但效率不高:)

python - 如何使用多个字符串条件加速熊猫布尔索引

2 回答 2

Related

Reference