numpy-memmap - 可以推迟对 numpy.memmap 的操作吗？

Question

考虑这个例子：

import numpy as np
a = np.array(1)
np.save("a.npy", a)

a = np.load("a.npy", mmap_mode='r')
print(type(a))

b = a + 2
print(type(b))

哪个输出

<class 'numpy.core.memmap.memmap'>
<class 'numpy.int32'>

因此，这似乎b不再是 amemmap了，我认为这会迫使numpy阅读整体a.npy，从而违背了 memmap 的目的。因此我的问题是，可以将操作memmaps推迟到访问时间吗？

我相信子类化ndarray或memmap可以工作，但对我的 Python 技能没有足够的信心来尝试它。

这是一个显示我的问题的扩展示例：

import numpy as np

# create 8 GB file
# np.save("memmap.npy", np.empty([1000000000]))

# I want to print the first value using f and memmaps


def f(value):
    print(value[1])


# this is fast: f receives a memmap
a = np.load("memmap.npy", mmap_mode='r')
print("a = ")
f(a)

# this is slow: b has to be read completely; converted into an array
b = np.load("memmap.npy", mmap_mode='r')
print("b + 1 = ")
f(b + 1)

score 1 · Accepted Answer

这就是python的工作原理。默认情况下，numpy 操作返回一个新数组，因此b永远不会作为 memmap 存在 - 它是在+调用 on时创建的a。

有几种方法可以解决这个问题。最简单的就是把所有的操作都做好，

a += 1

这需要加载内存映射数组进行读写，

a = np.load("a.npy", mmap_mode='r+')

当然，如果您不想覆盖原始数组，这并不是什么好事。
在这种情况下，您需要指定b应该被memmapped。

b = np.memmap("b.npy", mmap+mode='w+', dtype=a.dtype, shape=a.shape)

可以使用numpy ufuncs 提供out的关键字来完成分配。

np.add(a, 2, out=b)

score 1 · Accepted Answer

这是一个子类的简单示例，该ndarray子类将对其进行操作，直到索引请求特定元素为止。
我将其包括在内是为了表明它可以完成，但几乎可以肯定它会以新颖和意想不到的方式失败，并且需要大量工作才能使其可用。对于非常具体的情况，它可能比重新设计代码以更好的方式解决问题更容易。我建议阅读文档中的这些示例，以帮助了解它是如何工作的。

import numpy as np  
class Defered(np.ndarray):
      """
      An array class that deferrs calculations applied to it, only
      calculating them when an index is requested
      """
      def __new__(cls, arr):
            arr = np.asanyarray(arr).view(cls)
            arr.toApply = []
            return arr

      def __array_ufunc__(self, ufunc, method, *inputs, **kwargs):
            ## Convert all arguments to ndarray, otherwise arguments
            # of type Defered will cause infinite recursion
            # also store self as None, to be replaced later on
            newinputs = []
            for i in inputs:
                  if i is self:
                        newinputs.append(None)
                  elif isinstance(i, np.ndarray):
                        newinputs.append(i.view(np.ndarray))
                  else:
                        newinputs.append(i)

            ## Store function to apply and necessary arguments
            self.toApply.append((ufunc, method, newinputs, kwargs))
            return self

      def __getitem__(self, idx):
            ## Get index and convert to regular array
            sub = self.view(np.ndarray).__getitem__(idx)

            ## Apply stored actions
            for ufunc, method, inputs, kwargs in self.toApply:
                  inputs = [i if i is not None else sub for i in inputs]
                  sub = super().__array_ufunc__(ufunc, method, *inputs, **kwargs)

            return sub

如果对其进行了不使用 numpy 通用函数的修改，这将失败。例如percentile，median并且不基于 ufunc，最终会加载整个数组。同样，如果将它传递给迭代数组的函数，或将索引应用于大量数量，则将加载整个数组。

numpy-memmap - 可以推迟对 numpy.memmap 的操作吗？

2 回答 2

Related

Reference