I am in a situation where I want to load objects, transform them into an xarray.Dataset
and write that into a zarr store on s3. However, to make the loading of objects faster, I do it in parallel across distinct years using a Dask cluster, and thus, would also like to do the writing to s3 in parallel. The general structure here is:
def my_task(year, store, synchronizer):
# load objects for that year
...
dataset = ...
# I want to append this data to a zarr store on S3
dataset.to_zarr(store, append_dim="time", synchronizer=synchronizer)
futures = []
for y in years:
futures.append(client.submit(my_task, y, same_store, synchronizer))
client.gather(futures)
However, doing so fails: the store gets into an inconsistent state as multiple workers are writing at the same time. I have tried using zarr.sync.ProcessSynchronizer
but the issue persisted, similarly to this case: How can one write lock a zarr store during append?
Any help or guidance would be appreciated.Thanks!