numpy - 了解 Numpy 内部结构以进行分析

Question

分析一段 numpy 代码表明我大部分时间都花在这两个函数上

numpy/matrixlib/defmatrix.py.__getitem__:301
numpy/matrixlib/defmatrix.py.__array_finalize__:279

这是 Numpy 的源代码：

问题一：

__getitem__似乎每次我使用类似的东西时都会被调用，如果不是整数而是切片my_array[arg]，它会变得更加昂贵。arg有什么方法可以加快对数组切片的调用？

例如在

for i in range(idx): res[i] = my_array[i:i+10].mean()

问题2：

究竟什么时候__array_finalize__被调用，如何通过减少对该函数的调用次数来加快速度？

谢谢！

score 11 · Accepted Answer

您不能尽可能多地使用矩阵，而只能使用 2d numpy 数组。我通常只在短时间内使用矩阵来利用乘法语法（但是随着在数组上添加 .dot 方法，我发现我这样做的次数越来越少）。

但是，对于您的问题：

1）真的没有捷径，__getitem__除非defmatrix覆盖__getslice__它可以做但还没有。.item 和 .itemset 方法针对整数获取和设置进行了优化（并返回 Python 对象而不是 NumPy 的数组标量）

2)__array_finalize__在创建数组对象（或子类）时调用。它是从 C 函数调用的，每个数组创建都会通过该函数进行调用。 https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/ctors.c#L1003

对于纯粹在 Python 中定义的子类，它从 C 回调到 Python 解释器，这有开销。如果矩阵类是内置类型（例如，基于 Cython 的 cdef 类），则调用可以避免 Python 解释器开销。

score 3 · Accepted Answer

问题一：

由于数组切片有时可能需要底层数据结构的副本（在内存中保存指向数据的指针），因此它们可能非常昂贵。如果您在上面的示例中确实遇到了瓶颈，您可以通过实际迭代 i 到 i+10 个元素并手动创建均值来执行均值运算。对于某些操作，这不会带来任何性能改进，但避免创建新的数据结构通常会加快进程。

另一个注意事项，如果您没有在 numpy 中使用本机类型，则操作 numpy 数组会导致非常大的性能损失。假设您的数组具有 dtype=float64 并且您的本机机器浮点大小为 float32 - 这将花费大量额外的计算能力 numpy 并且整体性能会下降。有时这很好，您只需为维护数据类型而努力。其他时候，float 或 int 在内部存储的类型是任意的。在这些情况下，请尝试使用 dtype=float 而不是 dtype=float64。Numpy 应该默认为您的本机类型。通过进行此更改，我在 numpy 密集型算法上获得了 3 倍以上的加速。

问题2：

__array_finalize__ "is called whenever the system internally allocates a new array from obj, where obj is a subclass (subtype) of the (big)ndarray" according to SciPy. Thus this is a result described in the first question. When you slice and make a new array, you have to finalize that array by either making structural copies or wrapping the original structure. This operation takes time. Avoiding slices will save on this operation, though for multidimensional data it may be impossible to completely avoid calls to __array_finalize__.

numpy - 了解 Numpy 内部结构以进行分析

2 回答 2

Related

Reference