简短的回答:您的代码的更快版本是这样的:
tic;
for z = 1:1e3
for cr = 1:1000
B(:,cr) = B(:,cr) - k*row(cr);
end
end
toc;
你可能想看看我之前对这个问题的回答。简而言之,您的循环对行进行操作,而 MATLAB 是基于列的。这意味着列在内存中是连续的。您的原始循环遍历行,这是低效的。
我的电脑上的执行时间:
% A - k*row
Elapsed time is 4.370238 seconds.
% B(cr,:) = B(cr,:) - k(cr)*row;
Elapsed time is 9.537338 seconds.
% C = C - bsxfun(@times, k, row);
Elapsed time is 3.039836 seconds.
B(:,cr) = B(:,cr) - k*row(cr);
Elapsed time is 2.028186 seconds.
解释。您的第一个版本不是矩阵乘法,而是两个向量的外积,导致大小为 1000 x 1000 的矩阵。孔计算是 BLAS2 秩 1 更新(A=alpha x y'+A 是 GER 函数) . 最有可能的问题是 MATLAB 无法识别它,而是按照它理解的方式运行代码,即显式执行包括 k*row 在内的所有操作。这正是该解决方案的问题所在。外部产品必须分配大小等于矩阵大小的额外内存,这本身需要时间。考虑一下:
- 内存分配 - 由于我们不知道 MATLAB 如何管理内存分配,这可能意味着很多开销。
- 读取向量 k,行
- 写入结果矩阵 (KR)
- 读取 KR 以从 A 中减去它
- A的读写
两个 1000*1000 矩阵是 16 MB - 我怀疑你有这么多缓存。这就是为什么这个版本不是最好的原因,并且实际上可能比“内存效率低”循环慢,具体取决于可用的内存带宽和 CPU 缓存大小。
无需分配 KR 矩阵并将值显式存储在内存中 - 您可以在循环中计算所需的产品。因此,您有效地将内存带宽需求减半并消除了内存分配开销!
假设一个向量适合缓存(它确实 - 1000*8 字节并不多),您只从内存中读取 k 和 row 一次。现在,对列进行循环的算法非常有意义(这可能是 BLAS 实现此计算的方式)
- 读取 k 和 row 并将它们保存在缓存中
- stream A with full memory bandwidth to the CPU, subtract k*row product, stream back to memory
Now the final efficiency considerations. Take my loop structure. In every iteration we read and write A, and read the vectors. That is 16MB of data moved per iteration. 1000 iterations gives 16 GB of data moved in total. Two seconds required to compute the result gives 8GB/s effective memory bandwidth. My system has 16GB/s stream bandwidth when using 2 CPU cores, around 11-12 GB/s when using one. Hence, this sequential loop runs at 60-70% efficiency on one core. Not bad, considering this is a MATLAB loop :)
Note also that, at least on my computer, column-based loop version is (a bit more than) twice faster than the A-krow implementation (2s vs 4.37s). This strongly indicates that krow is indeed explicitly executed and constructed by MATLAB, and the total memory traffic is two times larger than in the looped version. Hence twice worse performance.
Edit You can still try to do it like in the first algorithm by calling directly the corresponding BLAS function. Have a look at this FEX contribution. It allows you to call BLAS and LAPACK functions directly from MATLAB. Might be useful..