python - 为什么 swifter 比 vanilla df.apply 慢？

Question

我有一个包含 100 万行的数据框。我有一个函数（我无法矢量化）应用于每一行。我研究了 swifter，它承诺利用多个进程来加速计算。然而，在 8 核机器上，情况并非如此。

知道为什么吗？

def parse_row(n_print=None):
    def f(row):
        if n_print is not None and row.name % n_print == 0:
            print(row.name, end="\r")
        return Feature(
            geometry=Point((float(row["longitude"]), float(row["latitude"]))),
            properties={
                "water_level": float(row["water_level"]),
                "return_period": float(row["return_period"])
            }
        )
    return f

In [12]: df["feature"] = df.swifter.apply(parse_row(), axis=1)
Dask Apply: 100%|████████████████████████████████████████| 48/48 [01:19<00:00,  1.65s/it]

In [13]: t = time(); df["feature"] = df.apply(parse_row(), axis=1); print(int(time() - t))
46

score 2 · Accepted Answer

这主要取决于所涉及的处理能力以及矢量化/并行处理/优化是否可以改善问题。有时它根本不是一个解决方案。还要记住，swifter 需要时间来计算它的预计工作时间跨度，有时 df.apply 会更快，因为它不必计算它，优化也可能没有帮助。

python - 为什么 swifter 比 vanilla df.apply 慢？

1 回答 1

Related

Reference