56

Background

My question is motivated by simple observations, which somewhat undermine the beliefs/assumptions often held/made by experienced MATLAB users:

  • MATLAB is very well optimized when it comes to the built-in functions and the fundamental language features, such as indexing vectors and matrices.
  • Loops in MATLAB are slow (despite the JIT) and should generally be avoided if the algorithm can be expressed in a native, 'vectorized' manner.

The bottom line: core MATLAB functionality is efficient and trying to outperform it using MATLAB code is hard, if not impossible.

Investigating performance of vector indexing

The example codes shown below are as fundamental as it gets: I assign a scalar value to all vector entries. First, I allocate an empty vector x:

tic; x = zeros(1e8,1); toc
Elapsed time is 0.260525 seconds.

Having x I would like to set all its entries to the same value. In practice you would do it differently, e.g., x = value*ones(1e8,1), but the point here is to investigate the performance of vector indexing. The simplest way is to write:

tic; x(:) = 1; toc
Elapsed time is 0.094316 seconds.

Let's call it method 1 (from the value assigned to x). It seems to be very fast (faster at least than memory allocation). Because the only thing I do here is operate on memory, I can estimate the efficiency of this code by calculating the obtained effective memory bandwidth and comparing it to the hardware memory bandwidth of my computer:

eff_bandwidth = numel(x) * 8 bytes per double * 2 / time

In the above, I multiply by 2 because unless SSE streaming is used, setting values in memory requires that the vector is both read from and written to the memory. In the above example:

eff_bandwidth(1) = 1e8*8*2/0.094316 = 17 Gb/s

STREAM-benchmarked memory bandwidth of my computer is around 17.9 Gb/s, so indeed - MATLAB delivers close to peak performance in this case! So far, so good.

Method 1 is suitable if you want to set all vector elements to some value. But if you want to access elements every step entries, you need to substitute the : with e.g., 1:step:end. Below is a direct speed comparison with method 1:

tic; x(1:end) = 2; toc
Elapsed time is 0.496476 seconds.

While you would not expect it to perform any different, method 2 is clearly big trouble: factor 5 slowdown for no reason. My suspicion is that in this case MATLAB explicitly allocates the index vector (1:end). This is somewhat confirmed by using explicit vector size instead of end:

tic; x(1:1e8) = 3; toc
Elapsed time is 0.482083 seconds.

Methods 2 and 3 perform equally bad.

Another possibility is to explicitly create an index vector id and use it to index x. This gives you the most flexible indexing capabilities. In our case:

tic;
id = 1:1e8; % colon(1,1e8);
x(id) = 4;
toc
Elapsed time is 1.208419 seconds.

Now that is really something - 12 times slowdown compared to method 1! I understand it should perform worse than method 1 because of the additional memory used for id, but why is it so much worse than methods 2 and 3?

Let's try to give the loops a try - as hopeless as it may sound.

tic;
for i=1:numel(x)
    x(i) = 5;
end
toc
Elapsed time is 0.788944 seconds.

A big surprise - a loop beats a vectorized method 4, but is still slower than methods 1, 2 and 3. It turns out that in this particular case you can do it better:

tic;
for i=1:1e8
    x(i) = 6;
end
toc
Elapsed time is 0.321246 seconds.

And that is the probably the most bizarre outcome of this study - a MATLAB-written loop significantly outperforms native vector indexing. That should certainly not be so. Note that the JIT'ed loop is still 3 times slower than the theoretical peak almost obtained by method 1. So there is still plenty of room for improvement. It is just surprising (a stronger word would be more suitable) that usual 'vectorized' indexing (1:end) is even slower.

Questions

  • is simple indexing in MATLAB very inefficient (methods 2, 3, and 4 are slower than method 1), or did I miss something?
  • why is method 4 (so much) slower than methods 2 and 3?
  • why does using 1e8 instead of numel(x) as a loop bound speed up the code by factor 2?

Edit After reading Jonas's comment, here is another way to do that using logical indices:

tic;
id = logical(ones(1, 1e8));
x(id) = 7;
toc
Elapsed time is 0.613363 seconds.

Much better than method 4.

For convenience:

function test

tic; x = zeros(1,1e8); toc

tic; x(:) = 1; toc
tic; x(1:end) = 2; toc
tic; x(1:1e8) = 3; toc

tic;
id = 1:1e8; % colon(1,1e8);
x(id) = 4;
toc

tic;
for i=1:numel(x)
    x(i) = 5;
end
toc

tic;
for i=1:1e8
    x(i) = 6;
end
toc

end
4

3 回答 3

14

当然,我只能推测。但是,当我在启用和禁用 JIT 编译器的情况下运行测试时,我得到以下结果:

 % with JIT   no JIT
    0.1677    0.0011 %# init
    0.0974    0.0936 %# #1 I added an assigment before this line to avoid issues with deferring
    0.4005    0.4028 %# #2
    0.4047    0.4005 %# #3
    1.1160    1.1180 %# #4
    0.8221   48.3239 %# #5 This is where "don't use loops in Matlab" comes from 
    0.3232   48.2197 %# #6
    0.5464   %# logical indexing

除法向我们展示了速度增加的地方:

% withoutJit./withJit
    0.0067 %# w/o JIT, the memory allocation is deferred
    0.9614 %# no JIT
    1.0057 %# no JIT
    0.9897 %# no JIT
    1.0018 %# no JIT
   58.7792 %# numel
  149.2010 %# no numel

初始化的明显加速发生了,因为在关闭 JIT 的情况下,MATLAB 似乎会延迟内存分配直到它被使用,所以 x=zeros(...) 并没有真正做任何事情。(谢谢,@angainor)。

方法 1 到 4 似乎没有从 JIT 中受益。我猜想#4 可能会因为额外的输入测试而变慢,subsref以确保输入的格式正确。

结果numel可能与编译器更难处理不确定的迭代次数有关,或者由于检查循环的边界是否正常而产生一些开销(认为无 JIT 测试建议只有 ~0.1s )

令人惊讶的是,在我机器上的 R2012b 上,逻辑索引似乎比 #4 慢。

我认为这再次表明,MathWorks 在加速代码方面做得很好,如果你想获得最快的执行时间,“不使用循环”并不总是最好的(至少眼下)。尽管如此,我发现矢量化通常是一种很好的方法,因为 (a) JIT 在更复杂的循环中失败,并且 (b) 学习矢量化可以让您更好地理解 Matlab。

结论:如果您想要速度,请使用分析器,如果您切换 Matlab 版本,请重新分析。 正如@Adriaan 在评论中指出的那样,现在使用 timeit() 来衡量执行速度可能会更好。


作为参考,我使用了以下稍微修改的测试功能

function tt = speedTest

tt = zeros(8,1);

tic; x = zeros(1,1e8); tt(1)=toc;

x(:) = 2;

tic; x(:) = 1; tt(2)=toc;
tic; x(1:end) = 2; tt(3)=toc;
tic; x(1:1e8) = 3; tt(4)=toc;

tic;
id = 1:1e8; % colon(1,1e8);
x(id) = 4;
tt(5)=toc;

tic;
for i=1:numel(x)
    x(i) = 5;
end
tt(6)=toc;

tic;
for i=1:1e8
    x(i) = 6;
end
tt(7)=toc;

%# logical indexing
tic;
id = true(1e8,1));
x(id)=7;
tt(8)=toc;
于 2012-11-14T16:12:14.017 回答
9

我没有对所有问题的答案,但我对方法2、3和4有一些精炼的推测。

关于方法 2 和 3。看起来 MATLAB 确实为向量索引分配内存并用从1to的值填充它1e8。为了理解它,让我们看看发生了什么。默认情况下,MATLAB 使用double它的数据类型。分配索引数组的时间与分配时间相同x

tic; x = zeros(1e8,1); toc
Elapsed time is 0.260525 seconds.

目前,索引数组仅包含零。x与方法 1 一样,以最佳方式为向量赋值需要0.094316几秒钟。现在,必须从内存中读取索引向量,以便在索引中使用它。那是额外的0.094316/2秒数。回想一下,x(:)=1向量x必须同时从内存中读取和写入。所以只阅读它需要一半的时间。假设这就是所有在 中完成的x(1:end)=value,方法 2 和 3 的总时间应该是

t = 0.260525+0.094316+0.094316/2 = 0.402

这几乎是正确的,但并不完全正确。我只能推测,但是用值填充索引向量可能是一个额外的步骤,并且需要额外的 0.094316 秒。因此 ,t=0.4963或多或少符合方法 2 和 3 的时间。

这些只是推测,但它们似乎确实证实了 MATLAB在进行本机向量索引时会显式创建索引向量。就个人而言,我认为这是一个性能错误。MATLAB 的 JIT 编译器应该足够聪明,能够理解这个简单的结构并将其转换为对正确内部函数的调用。就像现在一样,在当今的内存带宽受限架构中,索引的理论峰值约为 20%。

因此,如果您确实关心性能,则必须将其实现x(1:step:end)为 MEX 函数,例如

set_value(x, 1, step, 1e8, value);

现在这在 MATLAB 中显然是非法的,因为您不能就地修改 MEX 文件中的数组。

编辑关于方法4,可以尝试分析各个步骤的性能,如下所示:

tic;
id = 1:1e8; % colon(1,1e8);
toc
tic
x(id) = 4;
toc

Elapsed time is 0.475243 seconds.
Elapsed time is 0.763450 seconds.

第一步,分配和填充索引向量的值与单独的方法 2 和 3 所用的时间相同。似乎太多了 - 最多应该花费分配内存和设置值(0.260525s+0.094316s = 0.3548s)所需的时间,所以在某处有几秒钟的额外开销0.12,我无法理解。第二部分 ( ) 看起来也非常低效:设置 的值、读取向量 ( ) 以及对值进行一些错误检查x(id) = 4需要花费时间。用C编程,两个步骤采取:xid0.094316s+0.094316/2s = 0.1415sid

create id                              0.214259
x(id) = 4                              0.219768

使用的代码检查double索引是否实际上代表一个整数,并且它是否适合 的大小x

tic();
id  = malloc(sizeof(double)*n);
for(i=0; i<n; i++) id[i] = i;
toc("create id");

tic();
for(i=0; i<n; i++) {
  long iid = (long)id[i];
  if(iid>=0 && iid<n && (double)iid==id[i]){
    x[iid] = 4;
  } else break;
}
toc("x(id) = 4");

第二步花费的时间比预期的要长0.1415s——这是由于需要对id值进行错误检查。开销对我来说似乎太大了——也许它可以写得更好。不过,所需的时间是0.4340s,不是1.208419s。MATLAB 在幕后做了什么——我不知道。也许有必要这样做,我只是没有看到。

当然,使用doublesas 索引会引入两个额外的开销级别:

  • 大小的double两倍uint32。回想一下,内存带宽是这里的限制因素。
  • 需要将双精度转换为整数以进行索引

方法 4 可以使用整数索引在 MATLAB 中编写:

tic;
id = uint32(1):1e8;
toc
tic
x(id) = 8;
toc

Elapsed time is 0.327704 seconds.
Elapsed time is 0.561121 seconds.

这显然将性能提高了 30%,并证明应该使用整数作为向量索引。但是,开销仍然存在。

正如我现在所看到的,我们无法做任何事情来改善在 MATLAB 框架内工作的情况,我们必须等到 Mathworks 修复这些问题。

于 2012-11-14T17:42:03.783 回答
5

简单说明一下,在 8 年的发展中,MATLAB 的性能特征发生了很大变化。

这是在 R2017a 上(OP 发布 5 年后):

Elapsed time is 0.000079 seconds.    % x = zeros(1,1e8);
Elapsed time is 0.101134 seconds.    % x(:) = 1;
Elapsed time is 0.578200 seconds.    % x(1:end) = 2;
Elapsed time is 0.569791 seconds.    % x(1:1e8) = 3;
Elapsed time is 1.602526 seconds.    % id = 1:1e8; x(id) = 4;
Elapsed time is 0.373966 seconds.    % for i=1:numel(x), x(i) = 5; end
Elapsed time is 0.374775 seconds.    % for i=1:1e8, x(i) = 6; end

请注意 for 循环1:numel(x)比 indexing 更快x(1:end),似乎1:end仍在创建数组,而 for 循环则不是。现在在 MATLAB 中最好不要向量化!

(我确实x(:)=0在分​​配矩阵之后添加了一个赋值,在任何定时区域之外,实际分配了内存,因为zeros只保留内存。)


在 MATLAB R2020b(在线)(3 年后)上,我看到了这些时间:

Elapsed time is 0.000073 seconds.    % x = zeros(1,1e8);
Elapsed time is 0.084847 seconds.    % x(:) = 1;
Elapsed time is 0.084643 seconds.    % x(1:end) = 2;
Elapsed time is 0.085319 seconds.    % x(1:1e8) = 3;
Elapsed time is 1.393964 seconds.    % id = 1:1e8; x(id) = 4;
Elapsed time is 0.168394 seconds.    % for i=1:numel(x), x(i) = 5; end
Elapsed time is 0.169830 seconds.    % for i=1:1e8, x(i) = 6; end

x(1:end)现在在与 相同的情况下进行了优化,不再显式创建x(:)向量。1:end

于 2020-11-01T07:10:01.803 回答