对于这个话题np.einsum
,我已经阅读了一堆讨论:
为了更多地了解为什么np.eimsum
比通常np.sum
的 ,np.product
等更快(即使对于 anaconda 中的最新 numpy 版本),我正在使用np.einsum_path
来查看优化过程优化了什么。
这样做的时候,我发现了一个有趣的现象。考虑这个最小的例子:
import numpy as np
for i in 'int8 int16 int32 int64 uint8 uint16 uint32 uint64 float32 float64'.split():
print(np.einsum_path('i->', np.empty(2**30, i))[1])
输出都是相同的:
Complete contraction: i->
Naive scaling: 1
Optimized scaling: 1
Naive FLOP count: 1.074e+09
Optimized FLOP count: 2.147e+09
Theoretical speedup: 0.500
Largest intermediate: 1.000e+00 elements
--------------------------------------------------------------------------
scaling current remaining
--------------------------------------------------------------------------
1 i-> ->
优化的 FLOP在哪里增加(这意味着更多的计算?)并且理论加速比小于 1(这意味着更慢)。但是,如果我们实际计算计算时间:
for i in 'int8 int16 int32 int64 uint8 uint16 uint32 uint64 float32 float64'.split():
a = np.empty(2**27, i)
raw = %timeit -qon9 a.sum()
noOp = %timeit -qon9 np.einsum('i->', a, optimize=False)
op = %timeit -qon9 np.einsum('i->', a, optimize='greedy')
print(i, raw.average/op.average, noOp.average/op.average, sep='\t')
如果我们查看对应于“本机”时序除以优化时序的第二列,它们都接近 1,这意味着优化并没有使其变慢:
int8 4.485133392283354 1.0205873691331475
int16 3.7817373109729213 0.9528030137222752
int32 1.3760725925789292 1.0741615462167338
int64 1.0793509548186524 1.0076602576129605
uint8 4.509893894635594 0.997277624256872
uint16 3.964949791428885 0.9914991211913878
uint32 1.3054813163356085 1.009475242303559
uint64 1.0747670688044795 1.0082522386805526
float32 2.4105510701565636 0.9998241152368149
float64 2.1957241421227556 0.9836838487664662
我想知道伤口会np.einsum_path
说它需要更多的 FLOP 并且它更慢?我相信时间是直接从 FLOP 的数量计算出来的,所以这两个基准基本上指的是同一件事。
顺便说一句,我附上了一个示例,显示np.einsum_path
“通常”如何表现以说服上面的结果是异常的:
a = np.empty((64, 64))
print(np.einsum_path('ij,jk,kl->il', a, a, a)[1])
noOp = %timeit -qon99 np.einsum('ij,jk,kl->il', a, a, a, optimize=False)
op = %timeit -qon99 np.einsum('ij,jk,kl->il', a, a, a, optimize='greedy')
print('Actual speedup:', noOp.average / op.average)
输出:
Complete contraction: ij,jk,kl->il
Naive scaling: 4
Optimized scaling: 3
Naive FLOP count: 5.033e+07
Optimized FLOP count: 1.049e+06
Theoretical speedup: 48.000
Largest intermediate: 4.096e+03 elements
--------------------------------------------------------------------------
scaling current remaining
--------------------------------------------------------------------------
3 jk,ij->ik kl,ik->il
3 ik,kl->il il->il
Actual speed up: 90.33518444642904