我有一个问题,使用 dask-ml StandardScaler 内存泄漏非常巨大(分块)数组导致永远无法解决问题。这是我的代码和有关数据的信息。
from dask_ml.preprocessing import StandardScaler
import dask
from dask.distributed import Client, LocalCluster
import rioxarray
client = Client(memory_limit='200GB', n_workers=20, threads_per_worker=2, processes=False)
da = rioxarray.open_rasterio(r'H:/DEV/GMS/data/raster_stack01.dat')
da_rechunk = da.chunk({"band": 1, 'x': 5000, 'y': 5000})
接下来我尝试使用 StandardScaler:
scaler = StandardScaler()
scaler.fit(da_rechunk)
在这我收到这样的消息:
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 130.39 GiB -- Worker memory limit: 186.26 GiB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 130.40 GiB -- Worker memory limit: 186.26 GiB
distributed.utils_perf - WARNING - full garbage collections took 98% CPU time recently (threshold: 10%)
distributed.worker - WARNING - gc.collect() took 3.578s. This is usually a sign that some tasks handle too many Python objects at the same time. Rechunking the work into smaller tasks might help.
在客户端仪表板上,我看到它正在使用 over 4TB bytes + over 60GB spilled on disk
。在将所有块 xarrays 读入工作程序后,它会挂起处理。重新分块(1, 1000, 1000)
没有帮助。
dask-ml 中是否为此类用例实现了 StandardScaler?是 dask 的错误还是我做错了什么?