1
问题:

你好,

在搜索选择数据框的方法时(对 Pandas 相对缺乏经验),我有以下问题:

对于大型数据集,什么更快 - .isin() 或 .query()?

Query 阅读起来更直观一些,因此由于我的工作,我更喜欢这种方法。但是,在一个非常小的示例数据集上对其进行测试,查询似乎要慢得多。

有没有人以前正确测试过这个?如果有,结果如何?我搜索了网络,但找不到关于此的其他帖子。

请参阅下面的示例代码,它适用于 Python 3.8.5。

非常感谢您的帮助!

代码:
# Packages
import pandas as pd
import timeit
import numpy as np


# Create dataframe
df = pd.DataFrame({'name': ['Foo', 'Bar', 'Faz'],
               'owner': ['Canyon', 'Endurace', 'Bike']},
                index=['Frame', 'Type', 'Kind'])

# Show dataframe
df

# Create filter
selection = ['Canyon']

# Filter dataframe using 'isin' (type 1)
df_filtered = df[df['owner'].isin(selection)] 

%timeit df_filtered = df[df['owner'].isin(selection)]
213 µs ± 14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


# Filter dataframe using 'isin' (type 2)
df[np.isin(df['owner'].values, selection)]

%timeit df_filtered = df[np.isin(df['owner'].values, selection)]
128 µs ± 3.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


# Filter dataframe using 'query'
df_filtered = df.query("owner in @selection")

%timeit df_filtered = df.query("owner in @selection")
1.15 ms ± 9.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
4

2 回答 2

2

一些生成的数据的性能图:

基准1

假设一些假设数据,以及按比例增加的selection大小(帧大小的 10%)。

n=10 的样本数据:

df

       name  owner
0  Constant  JoVMq
1  Constant  jiKNB
2  Constant  WEqhm
3  Constant  pXNqB
4  Constant  SnlbV
5  Constant  Euwsj
6  Constant  QPPbs
7  Constant  Nqofa
8  Constant  qeUKP
9  Constant  ZBFce

Selection

['ZBFce']

性能反映了文档。在较小的帧上,开销query显着高于但是,在大约 200k 行的帧上,性能可与大约 10m 行的帧isin相媲美,并且开始变得更有性能。isinquery

我同意@jezrael的观点,这与大多数熊猫运行时问题一样,非常依赖数据,最好的测试是在给定用例的真实数据集上进行测试,并据此做出决定。


编辑:包括@AlexanderVolkovsky建议转换selection为集合并使用apply+ in

长凳 2


性能图代码:

import string

import numpy as np
import pandas as pd
import perfplot

charset = list(string.ascii_letters)

np.random.seed(5)


def gen_data(n):
    df = pd.DataFrame({'name': 'Constant',
                       'owner': [''.join(np.random.choice(charset, 5))
                                 for _ in range(n)]})
    selection = df['owner'].sample(frac=.1).tolist()
    return df, selection, set(selection)


def test_isin(params):
    df, selection, _ = params
    return df[df['owner'].isin(selection)]


def test_query(params):
    df, selection, _ = params
    return df.query("owner in @selection")


def test_apply_over_set(params):
    df, _, set_selection = params
    return df[df['owner'].apply(lambda x: x in set_selection)]


if __name__ == '__main__':
    out = perfplot.bench(
        setup=gen_data,
        kernels=[
            test_isin,
            test_query,
            test_apply_over_set
        ],
        labels=[
            'test_isin',
            'test_query',
            'test_apply_over_set'
        ],
        n_range=[2 ** k for k in range(25)],
        equality_check=None
    )
    out.save('perfplot_results.png', transparent=False)
于 2021-06-04T10:33:15.957 回答
2

真实数据中的最佳测试,此处快速比较 3k、300k、3M 行与此样本数据:

selection = ['Hedge']

df = pd.concat([df] * 1000, ignore_index=True)
In [139]: %timeit df[df['owner'].isin(selection)]
449 µs ± 58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [140]: %timeit df.query("owner in @selection")
1.57 ms ± 33.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

df = pd.concat([df] * 100000, ignore_index=True)
In [142]: %timeit df[df['owner'].isin(selection)]
8.25 ms ± 66.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [143]: %timeit df.query("owner in @selection")
13 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

df = pd.concat([df] * 1000000, ignore_index=True)
In [145]: %timeit df[df['owner'].isin(selection)]
94.5 ms ± 9.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [146]: %timeit df.query("owner in @selection")
112 ms ± 499 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

如果检查文档

对于大帧,使用 numexpr 的 DataFrame.query() 比 Python 稍快

结论- 真实数据中的最佳测试,因为取决于行数、匹配值的数量以及列表的长度selection

于 2021-06-04T10:10:29.247 回答