9

I am in a situation where I want to load objects, transform them into an xarray.Dataset and write that into a zarr store on s3. However, to make the loading of objects faster, I do it in parallel across distinct years using a Dask cluster, and thus, would also like to do the writing to s3 in parallel. The general structure here is:

def my_task(year, store, synchronizer):
    # load objects for that year
    ...
    dataset = ...
    # I want to append this data to a zarr store on S3
    dataset.to_zarr(store, append_dim="time", synchronizer=synchronizer)


futures = []    
for y in years:
    futures.append(client.submit(my_task, y, same_store, synchronizer))
client.gather(futures)

However, doing so fails: the store gets into an inconsistent state as multiple workers are writing at the same time. I have tried using zarr.sync.ProcessSynchronizer but the issue persisted, similarly to this case: How can one write lock a zarr store during append? Any help or guidance would be appreciated.Thanks!

4

0 回答 0