python - 使用 einsum 进一步优化 numpy 用于堆叠矩阵向量乘法

Question

我有一个使用线性矩阵变换的“粒子传播”相对简单的例子。

我的粒子分布基本上是一组（“束”）5维向量。它通常包含 100k 到 1M 个这样的向量。

这些向量中的每一个都必须乘以一个矩阵。

到目前为止我想出的解决方案如下。

粒子是这样创建的，协方差矩阵在这里显示为对角线，但这是为了一个相对简单的示例：

# Edit: I now use np.random_intel linking to MKL for improved performances
d = np.random.multivariate_normal(
    [0.0,
     0.0,
     0.0,
     0.0,
     0.0
     ],
    np.array([
        [1.0, 0.0, 0.0, 0.0, 0.0],
        [0.0, 1.0, 0.0, 0.0, 0.0],
        [0.0, 0.0, 1.0, 0.0, 0.0],
        [0.0, 0.0, 0.0, 1.0, 0.0],
        [0.0, 0.0, 0.0, 0.0, 0.1]
    ]),
    int(1e5)
)

传播矩阵很简单

D = np.array([[1, 10, 0, 0], 
          [0, 1, 0, 0],
          [0, 0, 1, 0],
          [0, 0, 0, 1]])

我的解决方案einsum是

r = np.einsum('ij,kj->ik', d[:, 0:4], D)

（请注意，这里我滑动以仅获取向量的前四个坐标，但原因无关）。

有没有办法显着加快速度？

我对所有细节没有清晰的看法，但这里有一些想法：

einsum默认情况下不调用 BLAS 但使用内部 SSE 优化，有没有办法用纯 BLAS 调用来表达我的问题，这会使其更快？
显然最近版本einsum的optimize选项可以打开以在更广泛的情况下回退到 BLAS 调用。我试过了，它不会改变执行时间。
PyPy 和 numpy 会更好吗？

我测试了@Divakar 的建议，它确实更快（10M 粒子）：

%%timeit
r = d[:, 0:4].dot(D.T)
# 541 ms ± 9.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

与我最初的相比

%%timeit -n 1 -r 1
r = np.einsum('ij,kj->ik', d[:, 0:4], D, optimize=True)
# 1.74 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

一个可能影响最终答案的直接相关问题：

我该如何处理“丢失”的粒子？

在逐个粒子矩阵相乘之后，我将检查一些坐标的上限，例如 (r是上一步的结果：

selected = (r[:, 0] < 0.1) & (r[:, 1] < 0.1)
ind = np.where(selected)
r[ind]

然后应用的下一轮矩阵乘法r[ind]。

有几件事我不清楚：

这是最有效的吗？
它不会创建太多副本吗？
在跟踪它们丢失的事实（通过掩码）的同时“保留”未选择的粒子（并无论如何将它们相乘）不是更好吗？那是更多的乘法，但是可以将所有内容都保存在一个对象中，而无需进一步分配并保持所有内容对齐？

score 1 · Accepted Answer

为了进一步提高@Divakar 建议的代码的性能，我宁愿建议使用PyTorch库。与使用NumPy 数组的普通点积（np.dot()ms

首先，我将演示如何在 NumPy 和PyTorch中做到这一点。（由于PyTorch与NumPy ndarray共享相同的内存，我们不需要做额外的工作）

计时

# setup inputs
In [61]: d = np.random.multivariate_normal(
    ...:     [0.0,
    ...:      0.0,
    ...:      0.0,
    ...:      0.0,
    ...:      0.0
    ...:      ],
    ...:     np.array([
    ...:         [1.0, 0.0, 0.0, 0.0, 0.0],
    ...:         [0.0, 1.0, 0.0, 0.0, 0.0],
    ...:         [0.0, 0.0, 1.0, 0.0, 0.0],
    ...:         [0.0, 0.0, 0.0, 1.0, 0.0],
    ...:         [0.0, 0.0, 0.0, 0.0, 0.1]
    ...:     ]),
    ...:     int(1e5)
    ...: )

In [62]: d.dtype
Out[62]: dtype('float64')

In [63]: D = np.array([[1, 10, 0, 0], 
    ...:           [0, 1, 0, 0],
    ...:           [0, 0, 1, 0],
    ...:           [0, 0, 0, 1]], dtype=np.float64)
    ...:           

In [64]: DT = D.T

In [65]: DT.dtype
Out[65]: dtype('float64')


# create input tensors in PyTorch
In [66]: d_tensor = torch.DoubleTensor(d[:, 0:4])

In [67]: DT_tensor = torch.DoubleTensor(DT)

# float64 tensors
In [69]: type(d_tensor), type(DT_tensor)
Out[69]: (torch.DoubleTensor, torch.DoubleTensor)

# dot/matmul using `np.dot()`
In [73]: np_dot = np.dot(d[:, 0:4], DT)

# matmul using `torch.matmul()`
In [74]: torch_matmul = torch.matmul(d_tensor, DT_tensor)

# sanity check!! :)
In [75]: np.allclose(np_dot, torch_matmul)
Out[75]: True

现在是不同方法的时机！

In [5]: %timeit r = np.einsum('ij,kj->ik', d[:, 0:4], D)
2.63 ms ± 97.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: %timeit r = d[:, 0:4].dot(D.T)
1.56 ms ± 47.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [7]: %timeit r = np.einsum('ij,kj->ik', d[:, 0:4], D, optimize=True)
2.73 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# over 2 orders of magnitude faster :)
In [14]: %timeit torch_matmul = torch.matmul(d_tensor, DT_tensor)
87 µs ± 7.71 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

需要注意的重要一点是，我们需要在NumPy ndarray和PyTorch张量中具有相同的数据类型。（这里我使用了since返回的值。所以，我将矩阵向上转换为。相应地，在创建PyTorch张量时，我使用了相当于的。这种数据类型匹配对于获得相同的结果是必不可少的，尤其是在处理浮点数时数）。np.float64np.random.multivariate_normalfloat64Dfloat64torch.DoubleTensornp.float64

因此，关键的一点是PyTorch 张量运算比NumPy ndarray运算快几个数量级。

python - 使用 einsum 进一步优化 numpy 用于堆叠矩阵向量乘法

1 回答 1

Related

Reference