0

我有一个问题,使用 dask-ml StandardScaler 内存泄漏非常巨大(分块)数组导致永远无法解决问题。这是我的代码和有关数据的信息。

from dask_ml.preprocessing import StandardScaler
import dask
from dask.distributed import Client, LocalCluster

import rioxarray

client = Client(memory_limit='200GB', n_workers=20, threads_per_worker=2, processes=False)
da = rioxarray.open_rasterio(r'H:/DEV/GMS/data/raster_stack01.dat')

da_rechunk = da.chunk({"band": 1, 'x': 5000, 'y': 5000})

在上面的结果中,我有这个: da_rechunk 内容

接下来我尝试使用 StandardScaler:

scaler = StandardScaler()
scaler.fit(da_rechunk)

在这我收到这样的消息:

distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 130.39 GiB -- Worker memory limit: 186.26 GiB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 130.40 GiB -- Worker memory limit: 186.26 GiB
distributed.utils_perf - WARNING - full garbage collections took 98% CPU time recently (threshold: 10%)
distributed.worker - WARNING - gc.collect() took 3.578s. This is usually a sign that some tasks handle too many Python objects at the same time. Rechunking the work into smaller tasks might help.

在客户端仪表板上,我看到它正在使用 over 4TB bytes + over 60GB spilled on disk。在将所有块 xarrays 读入工作程序后,它会挂起处理。重新分块(1, 1000, 1000)没有帮助。

dask-ml 中是否为此类用例实现了 StandardScaler?是 dask 的错误还是我做错了什么?

4

0 回答 0