是否需要在 LocalCluster 内部或外部进行计算(如 dask 方法 dd.merge)?最终计算(如 .compute)需要在 LocalCluster 内部还是外部进行?
我的主要问题是 - LocalCluster() 如何影响任务数量?
我和我的同事注意到,将 dd.merge 放在 LocalCLuster() 之外会显着降低任务数量(比如 10 倍或类似的东西)。这是什么原因?
伪例子
许多任务:
dd.read_parquet(somewhere, index=False)
with LocalCluster(
n_workers=8,
processes=True,
threads_per_worker=1,
memory_limit="10GB",
ip="tcp://localhost:9895",
) as cluster, Client(cluster) as client:
dd.merge(smth)
smth..to_parquet(
somewhere, engine="fastparquet", compression="snappy"
)
几个任务:
dd.read_parquet(somewhere, index=False)
dd.merge(smth)
with LocalCluster(
n_workers=8,
processes=True,
threads_per_worker=1,
memory_limit="10GB",
ip="tcp://localhost:9895",
) as cluster, Client(cluster) as client:
smth..to_parquet(
somewhere, engine="fastparquet", compression="snappy"
)