python - 从“isel”操作创建的 Dask xarray 无法加载值（太慢）

Question

我有一个 4d dask 数组，其尺寸对应于（时间、深度、纬度、经度）。仅供参考，这是一个海洋数据集。

# Create xarray dataset object for ocean temperature (time x depth x lat x lon)

DS=xr.open_mfdataset([outdir + s for s in flist],combine='by_coords',chunks={'xi_rho':25,'eta_rho':25})

DS.temp

输出：

xarray.DataArray
'temp' (ocean_time: 1456, s_rho: 50, eta_rho: 489, xi_rho: 655)
dask.array<chunksize=(1456, 50, 25, 25), meta=np.ndarray>

当我尝试从此 dask 数组加载一维向量时，没有问题。

T=DS.temp
%time
T.isel(ocean_time=0,eta_rho=100,xi_rho=500).values

输出（省略以下值）：

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 5.96 µs

我现在只想选择海底比 1000 m 更深的那些（纬度，经度）。

depth_max=1e3;
deep=np.where(DS.h >=depth_max); # DS.h is the depth of the ocean bottom.

# Number of locations where the ocean is deeper than depth_max
xy_num=len(deep[0])

这给了我一个元组'deep'，它的第一个元素是满足条件(deep[0])的所有（纬度索引）值的列表。'eta_rho'元组的第二个元素(deep[1])是相应'xi_rho'（经度索引）值的列表。所以，我可以使用等构造(lat,lon)索引对。(deep[0][0],deep[1][0]), (deep[0][1],deep[1][1]), (deep[0][2],deep[1][2]), (deep[0][3],deep[1][3])

这很棒，因为我想创建一个扫描(lat,lon)满足上述条件的对的单个坐标。目标是从(time,depth,lat,lon) -> (time,depth,gthmax)哪里得到gthmax上面描述的新坐标。我这样做：

# Picking only those (lat,lon) where the condition is satisfied
T_deep=T.isel(eta_rho=xr.DataArray(deep[0],dims='gthmax'),xi_rho=xr.DataArray(deep[1],dims='gthmax'))

# Explicitly assign the new coordinate
T_deep=T_deep.assign_coords({"gthmax":range(0,xy_num)})

# Create chunks along this new coordinate
T_deep=T_deep.chunk({'gthmax':1000})

T_deep

输出（仅显示尺寸）：

xarray.DataArray 'temp' (ocean_time: 1456, s_rho: 50, gthmax: 133446)
dask.array<chunksize=(1456, 50, 1000), meta=np.ndarray>

这就是问题出现的地方。当我尝试从这个新的 3d dask 数组中访问值时，即使是在一个点上，下面的命令也永远不会完成，我最终不得不杀死内核。我也尝试过使用load()and compute()，但无济于事。

T_deep.isel(ocean_time=0,s_rho=46,gthmax=100).values

关于我在从原始 4d dask 阵列到 3d dask 阵列的转换中搞砸的任何想法？

谢谢！

score 0 · Accepted Answer

我没有数据集进行测试，受限信息很难说。这是我的第一个猜测。

当您使用 xr.open_mfdataset 加载信息时，实际上是为地形变量 h 分配了一个新维度“ocean_time”。“h”的维度应该从 [eta_rho, xi_rho] 转移到 [ocean_time, eta_rho, xi_rho]。因此在工作时

depth_max=1e3  
deep = np.where( DS.h >= depth_max)

deep 的维度不是 [eta_rho, xi_rho]；其实他们也是[ocean_time, eta_rho, xi_rho]。我想，因此，问题就出现在这里。您正在使用 [ocean_time, s_rho, eta_rho, xi_rho] 将坐标系推到 [ocean_time, s_rho, gthmax]

eta_rho=xr.DataArray(deep[0],dims='gthmax')  
xi_rho=xr.DataArray(deep[1],dims='gthmax')

请注意，维度“ocean_time”的长度为 1456，远大于水平维度（eta_rho：489，xi_rho：655）。我相信这会混淆 xarray/dask 并在您想要调用该值时减慢进程。

既然我得到了你的数据集，让我更新我的答案。

df = xr.open_mfdataset('./*.nc',  
                       combine='by_coords',  
                       chunks={'ocean_time':10}, 
                       parallel=True, 
)                                                                                                                 

temp = df[['temp']]                                                                                            
temp1 = temp.sel(xi_rho=xi_rho,eta_rho=eta_rho)                                                                   
%time temp1.isel(ocean_time=0,s_rho=46,z=100).compute()                                                          
CPU times: user 2.2 s, sys: 1.11 s, total: 3.31 s
Wall time: 3.05 s

python - 从“isel”操作创建的 Dask xarray 无法加载值（太慢）

1 回答 1

Related

Reference