python - Xarray 选择变量并计算其平均值的最有效方法

Question

我有一个用 xarray 打开的 3Gb 数据立方体，其中有 3 个我感兴趣的变量（v、vx、vy）。描述如下和代码。

我只对 2009 年到 2013 年之间的一个特定时间窗口感兴趣，而整个数据集跨越 1984 年到 2018 年。

我想做的是：

获取 2009 年到 2013 年之间的 v、vx、vy 值
沿时间轴计算它们的平均值并将它们保存为三个 334x333 数组

问题是它花费了太多时间，以至于 1 小时后，我写的几行代码仍在运行。我不明白的是，如果我将“v”值保存为数组，照此加载并计算它们的平均值，那么它所花费的时间比我在下面写的要少得多（参见代码）。我不知道是否存在内存泄漏，或者这只是一种糟糕的方式。我的电脑有 16Gb 的 RAM，其中 60% 在加载数据立方体之前可用。所以理论上它应该有足够的内存来计算一切。

将我的数据立方体截断到所需时间窗口的有效方法是什么，然后计算 3 个变量 "v"、"vx"、"vy" 的时间平均值（在轴 0 上）？

我试着这样做：

datacube = xr.open_dataset('datacube.nc')  # Load the datacube
datacube = datacube.reindex(mid_date = sorted(datacube.mid_date.values))  # Sort the datacube by ascending time, where "mid_date" is the time dimension
    
sdate = '2009-01'   # Start date
edate = '2013-12'   # End date
    
ds = datacube.sel(mid_date = slice(sdate, edate))   # Create a new datacube gathering only the values between the start and end dates
    
vvtot = np.nanmean(ds.v.values, axis=0)   # Calculate the mean of the values of the "v" variable of the new datacube
vxtot = np.nanmean(ds.vx.values, axis=0)
vytot = np.nanmean(ds.vy.values, axis=0)






Dimensions:                    (mid_date: 18206, y: 334, x: 333)
Coordinates:
  * mid_date                   (mid_date) datetime64[ns] 1984-06-10T00:00:00....
  * x                          (x) float64 4.868e+05 4.871e+05 ... 5.665e+05
  * y                          (y) float64 6.696e+06 6.696e+06 ... 6.616e+06
Data variables: (12/43)
    UTM_Projection             object ...
    acquisition_img1           (mid_date) datetime64[ns] ...
    acquisition_img2           (mid_date) datetime64[ns] ...
    autoRIFT_software_version  (mid_date) float64 ...
    chip_size_height           (mid_date, y, x) float32 ...
    chip_size_width            (mid_date, y, x) float32 ...
                        ...
    vy                         (mid_date, y, x) float32 ...
    vy_error                   (mid_date) float32 ...
    vy_stable_shift            (mid_date) float64 ...
    vyp                        (mid_date, y, x) float64 ...
    vyp_error                  (mid_date) float64 ...
    vyp_stable_shift           (mid_date) float64 ...
Attributes:
    GDAL_AREA_OR_POINT:         Area
    datacube_software_version:  1.0
    date_created:               30-01-2021 20:49:16
    date_updated:               30-01-2021 20:49:16
    projection:                 32607

score 1 · Accepted Answer

尽量避免在两者之间调用“.values”，因为当你这样做时，你会切换到np.array而不是xr.DataArray！

import xarray as xr
from dask.diagnostics import ProgressBar

# Open the dataset using chunks.
ds = xr.open_dataset(r"/path/to/you/data/test.nc", chunks = "auto")

# Select the period you want to have the mean for. 
ds = ds.sel(time = slice(sdate, edate))

# Calculate the mean for all the variables in your ds.
ds = ds.mean(dim = "time")

# The above code takes less than a second, because no actual
# calculations have been done yet (and no data has been loaded into your RAM).
# Once you use ".values", ".compute()", or
# ".to_netcdf()" they will be done. We can see progress like this:
with ProgressBar():
    ds = ds.compute()

python - Xarray 选择变量并计算其平均值的最有效方法

1 回答 1

Related

Reference