7

Currently, my code heavily uses structured masked arrays with multidimensional dtypes, with dozens of fields and item sizes of many kilobytes. It appears that xarray could be a great alternative, but when I try to pass it a masked array, it changes its dtype to float:

In [137]: x = arange(30, dtype="i1").reshape(3, 10)

In [138]: xr.Dataset({"count": (["x", "y"], ma.masked_where(x%5>3, x))}, coords={"x": range(3), "y":
     ...: range(10)})
Out[138]:
<xarray.Dataset>
Dimensions:  (x: 3, y: 10)
Coordinates:
  * y        (y) int64 0 1 2 3 4 5 6 7 8 9
  * x        (x) int64 0 1 2
Data variables:
    count    (x, y) float64 0.0 1.0 2.0 3.0 nan 5.0 6.0 7.0 8.0 nan 10.0 ...

This is undesirable for me, because (1) the memory consumption of my dataset will explode (it is already large), and (2) many of my integer-dtypes are bit fields which must not be represented as floats. Although an int32 bitfield can be losslessly represented as a float64, it's ugly and error-prone to go back and forth.

Is it possible to use xarray.Dataset with masked arrays while preserving integer dtypes?


Edit: It appears the problem occurs in _maybe_promote. See also github issue.

4

1 回答 1

6

不幸的是,xarray 不支持掩码数组或任何形式的具有缺失值的整数 dtype。这种选择的原因与 pandas (当前)不支持整数 NA 的原因相同,如Cavaets 和 Gotchas下的 pandas 文档所述。我们需要一个支持 NumPy 数组缺失值的整数 dtype,遗憾的是它不存在。

我同意这对于缺少值的图像不是一个非常令人满意的解决方案,尽管在许多情况下我发现它足以处理非屏蔽整数数据,仅在算术必要时转换为浮点数(并屏蔽缺失值)(例如,利用.fillna())。

关于内存使用,我建议尝试使用 xarray 和dask,它允许以流式或分布式方式执行大多数数组操作。

于 2017-01-07T02:50:05.737 回答