performance - matlab matrix operation speed

Question

I've been asked to make some MATLAB code run faster, and have run into something that seems strange to me.

In one of the functions there's a loop where we multiply a 3x1 vector (let's call it x) - a 3x3 matrix (let's call it A) - and the transpose of x, yielding a scalar. The code has the whole set of element-by-element multiplications and additions, and is pretty cumbersome:

val = x(1)*A(1,1)*x(1) + x(1)*A(1,2)*x(2) + x(1)*A(1,3)*x(3) + ...
      x(2)*A(2,1)*x(1) + x(2)*A(2,2)*x(2) + x(2)*A(2,3)*x(3) + ... 
      x(3)*A(3,1)*x(1) + x(3)*A(3,2)*x(2) + x(3)*A(3,3)*x(3);

I figured I'd just replace it all by:

val = x*A*x';

To my surprise, it ran significantly slower (as in 4-5 times slower). Is it just that the vector and matrix are so small that MATLAB's optimizations don't apply?

score 8 · Accepted Answer

编辑：我改进了测试以提供更准确的时间。我还优化了展开的版本，它现在比我最初拥有的要好得多，随着大小的增加，矩阵乘法仍然更快。

EDIT2：为了确保 JIT 编译器正在处理展开的函数，我修改了代码以将生成的函数编写为 M 文件。此外，现在可以将比较视为公平的，因为这两种方法都是通过将 TIMEIT 传递给函数句柄来评估的：timeit(@myfunc)

我不相信您的方法比合理大小的矩阵乘法更快。所以让我们比较这两种方法。

我正在使用符号数学工具箱来帮助我获得方程的“展开”形式x'*A*x（尝试手动将 20x20 矩阵和 20x1 向量相乘！）：

function f = buildUnrolledFunction(N)
    % avoid regenerating files, CCODE below can be really slow!
    fname = sprintf('f%d',N);
    if exist([fname '.m'], 'file')
        f = str2func(fname);
        return
    end

    % construct symbolic vector/matrix of the specified size
    x = sym('x', [N 1]);
    A = sym('A', [N N]);

    % work out the expanded form of the matrix-multiplication
    % and convert it to a string
    s = ccode(expand(x.'*A*x));    % instead of char(.) to avoid x^2

    % a bit of RegExp to fix the notation of the variable names
    % also convert indexing into linear indices: A(3,3) into A(9)
    s = regexprep(regexprep(s, '^.*=\s+', ''), ';$', '');
    s = regexprep(regexprep(s, 'x(\d+)', 'x($1)'), 'A(\d+)_(\d+)', ...
        'A(${ int2str(sub2ind([N N],str2num($1),str2num($2))) })');

    % build an M-function from the string, and write it to file
    fid = fopen([fname '.m'], 'wt');
    fprintf(fid, 'function v = %s(A,x)\nv = %s;\nend\n', fname, s);
    fclose(fid);

    % rehash path and return a function handle
    rehash
    clear(fname)
    f = str2func(fname);
end

我试图通过避免取幂来优化生成的函数（我们更喜欢x*x）x^2。我还将下标转换为线性索引（A(9)而不是A(3,3)）。因此，n=3我们得到与您相同的等式：

>> s
s =
A(1)*(x(1)*x(1)) + A(5)*(x(2)*x(2)) + A(9)*(x(3)*x(3)) + 
A(4)*x(1)*x(2) + A(7)*x(1)*x(3) + A(2)*x(1)*x(2) + 
A(8)*x(2)*x(3) + A(3)*x(1)*x(3) + A(6)*x(2)*x(3)

鉴于上述构造 M 函数的方法，我们现在评估它的各种大小并将其与矩阵乘法形式进行比较（我将它放在一个单独的函数中以考虑函数调用开销）。我正在使用TIMEIT函数而不是tic/toc获得更准确的时间。同样为了公平比较，每个方法都实现为一个 M 文件函数，该函数将所有需要的变量作为输入参数传递。

function results = testMatrixMultVsUnrolled()
    % vector/matrix size
    N_vec = 2:50;
    results = zeros(numel(N_vec),3);
    for ii = 1:numel(N_vec);
        % some random data
        N = N_vec(ii);
        x = rand(N,1); A = rand(N,N);

        % matrix multiplication
        f = @matMult;
        results(ii,1) = timeit(@() feval(f, A,x));

        % unrolled equation
        f = buildUnrolledFunction(N);
        results(ii,2) = timeit(@() feval(f, A,x));

        % check result
        results(ii,3) = norm(matMult(A,x) - f(A,x));
    end

    % display results
    fprintf('N = %2d: mtimes = %.6f ms, unroll = %.6f ms [error = %g]\n', ...
        [N_vec(:) results(:,1:2)*1e3 results(:,3)]')
    plot(N_vec, results(:,1:2)*1e3, 'LineWidth',2)
    xlabel('size (N)'), ylabel('timing [msec]'), grid on
    legend({'mtimes','unrolled'})
    title('Matrix multiplication: $$x^\mathsf{T}Ax$$', ...
        'Interpreter','latex', 'FontSize',14)
end

function v = matMult(A,x)
    v = x.' * A * x;
end

结果：

计时特写

N =  2: mtimes = 0.008816 ms, unroll = 0.006793 ms [error = 0]
N =  3: mtimes = 0.008957 ms, unroll = 0.007554 ms [error = 0]
N =  4: mtimes = 0.009025 ms, unroll = 0.008261 ms [error = 4.44089e-16]
N =  5: mtimes = 0.009075 ms, unroll = 0.008658 ms [error = 0]
N =  6: mtimes = 0.009003 ms, unroll = 0.008689 ms [error = 8.88178e-16]
N =  7: mtimes = 0.009234 ms, unroll = 0.009087 ms [error = 1.77636e-15]
N =  8: mtimes = 0.008575 ms, unroll = 0.009744 ms [error = 8.88178e-16]
N =  9: mtimes = 0.008601 ms, unroll = 0.011948 ms [error = 0]
N = 10: mtimes = 0.009077 ms, unroll = 0.014052 ms [error = 0]
N = 11: mtimes = 0.009339 ms, unroll = 0.015358 ms [error = 3.55271e-15]
N = 12: mtimes = 0.009271 ms, unroll = 0.018494 ms [error = 3.55271e-15]
N = 13: mtimes = 0.009166 ms, unroll = 0.020238 ms [error = 0]
N = 14: mtimes = 0.009204 ms, unroll = 0.023326 ms [error = 7.10543e-15]
N = 15: mtimes = 0.009396 ms, unroll = 0.024767 ms [error = 3.55271e-15]
N = 16: mtimes = 0.009193 ms, unroll = 0.027294 ms [error = 2.4869e-14]
N = 17: mtimes = 0.009182 ms, unroll = 0.029698 ms [error = 2.13163e-14]
N = 18: mtimes = 0.009330 ms, unroll = 0.033295 ms [error = 7.10543e-15]
N = 19: mtimes = 0.009411 ms, unroll = 0.152308 ms [error = 7.10543e-15]
N = 20: mtimes = 0.009366 ms, unroll = 0.167336 ms [error = 7.10543e-15]
N = 21: mtimes = 0.009335 ms, unroll = 0.183371 ms [error = 0]
N = 22: mtimes = 0.009349 ms, unroll = 0.200859 ms [error = 7.10543e-14]
N = 23: mtimes = 0.009411 ms, unroll = 0.218477 ms [error = 8.52651e-14]
N = 24: mtimes = 0.009307 ms, unroll = 0.235668 ms [error = 4.26326e-14]
N = 25: mtimes = 0.009425 ms, unroll = 0.256491 ms [error = 1.13687e-13]
N = 26: mtimes = 0.009392 ms, unroll = 0.274879 ms [error = 7.10543e-15]
N = 27: mtimes = 0.009515 ms, unroll = 0.296795 ms [error = 2.84217e-14]
N = 28: mtimes = 0.009567 ms, unroll = 0.319032 ms [error = 5.68434e-14]
N = 29: mtimes = 0.009548 ms, unroll = 0.339517 ms [error = 3.12639e-13]
N = 30: mtimes = 0.009617 ms, unroll = 0.361897 ms [error = 1.7053e-13]
N = 31: mtimes = 0.009672 ms, unroll = 0.387270 ms [error = 0]
N = 32: mtimes = 0.009629 ms, unroll = 0.410932 ms [error = 1.42109e-13]
N = 33: mtimes = 0.009605 ms, unroll = 0.434452 ms [error = 1.42109e-13]
N = 34: mtimes = 0.009534 ms, unroll = 0.462961 ms [error = 0]
N = 35: mtimes = 0.009696 ms, unroll = 0.489474 ms [error = 5.68434e-14]
N = 36: mtimes = 0.009691 ms, unroll = 0.512198 ms [error = 8.52651e-14]
N = 37: mtimes = 0.009671 ms, unroll = 0.544485 ms [error = 5.68434e-14]
N = 38: mtimes = 0.009710 ms, unroll = 0.573564 ms [error = 8.52651e-14]
N = 39: mtimes = 0.009946 ms, unroll = 0.604567 ms [error = 3.41061e-13]
N = 40: mtimes = 0.009735 ms, unroll = 0.636640 ms [error = 3.12639e-13]
N = 41: mtimes = 0.009858 ms, unroll = 0.665719 ms [error = 5.40012e-13]
N = 42: mtimes = 0.009876 ms, unroll = 0.697364 ms [error = 0]
N = 43: mtimes = 0.009956 ms, unroll = 0.730506 ms [error = 2.55795e-13]
N = 44: mtimes = 0.009897 ms, unroll = 0.765358 ms [error = 4.26326e-13]
N = 45: mtimes = 0.009991 ms, unroll = 0.800424 ms [error = 0]
N = 46: mtimes = 0.009956 ms, unroll = 0.829717 ms [error = 2.27374e-13]
N = 47: mtimes = 0.010210 ms, unroll = 0.865424 ms [error = 2.84217e-13]
N = 48: mtimes = 0.010022 ms, unroll = 0.907974 ms [error = 3.97904e-13]
N = 49: mtimes = 0.010098 ms, unroll = 0.944536 ms [error = 5.68434e-13]
N = 50: mtimes = 0.010153 ms, unroll = 0.984486 ms [error = 4.54747e-13]

在小尺寸下，这两种方法的表现有些相似。虽然为N<7扩展版beats mtimes，但差别不大。一旦我们超越了微小的尺寸，矩阵乘法就会快几个数量级。

这并不奇怪。只有公式很长，涉及添加 400 个术语N=20。由于解释了 MATLAB 语言，我怀疑这是否非常有效..

现在我同意调用外部函数与直接嵌入代码相比有开销，但这种方法有多实用。即使是小尺寸 as N=20，生成的行也超过 7000 个字符！我还注意到 MATLAB 编辑器由于行长而变得迟缓:)

而且，优势在左右之后很快就消失了N>10。我比较了嵌入式代码/显式编写与矩阵乘法，类似于@DennisJaheruddin 的建议。结果：_

N=3:
  Elapsed time is 0.062295 seconds.    % unroll
  Elapsed time is 1.117962 seconds.    % mtimes

N=12:
  Elapsed time is 1.024837 seconds.    % unroll
  Elapsed time is 1.126147 seconds.    % mtimes

N=19:
  Elapsed time is 140.915138 seconds.  % unroll
  Elapsed time is 1.305382 seconds.    % mtimes

...对于展开的版本，它只会变得更糟。就像我之前说的，MATLAB 是经过解释的，因此解析代码的成本开始体现在如此巨大的文件中。

在我看来，在进行了一百万次迭代之后，我们最多只获得了 1 秒，我认为这并不能证明所有的麻烦和黑客行为都是合理的，而不是使用更具可读性和简洁性的v=x'*A*x. 因此，也许代码中还有其他地方可以改进，而不是专注于已经优化的操作，例如矩阵乘法。

MATLAB中的矩阵乘法非常快（这是 MATLAB 最擅长的！）。一旦你获得足够大的数据（多线程开始），它真的会发光：

>> N=5000; x=rand(N,1); A=rand(N,N);
>> tic, for i=1e4, v=x.'*A*x; end, toc
Elapsed time is 0.021959 seconds.

score 2 · Accepted Answer

@Amro 给出了广泛的答案，我同意一般来说，您不应该费心写出显式计算，而只需在代码中的任何地方使用矩阵乘法。

但是，如果您的矩阵足够小，并且您确实需要计算数十亿次，那么写出的表格可以明显更快（更少的开销）。然而，诀窍是不要将您的代码放在单独的函数中，因为调用开销将远远大于计算时间。

这是一个小例子：

x = 1:3;
A = rand(3);
v=0;

unroll = @(x) A(1)*(x(1)*x(1)) + A(5)*(x(2)*x(2)) + A(9)*(x(3)*x(3)) + A(4)*x(1)*x(2) + A(7)*x(1)*x(3) + A(2)*x(1)*x(2) + A(8)*x(2)*x(3) + A(3)*x(1)*x(3) + A(6)*x(2)*x(3); 
regular = @(x) x*A*x'; 

%Written out, no function call
tic
for t = 1:1e6
  v = A(1)*(x(1)*x(1)) + A(5)*(x(2)*x(2)) + A(9)*(x(3)*x(3)) + A(4)*x(1)*x(2) + A(7)*x(1)*x(3) + A(2)*x(1)*x(2) + A(8)*x(2)*x(3) + A(3)*x(1)*x(3) + A(6)*x(2)*x(3);;
end
t1=toc;

%Matrix form, no function call
tic
for t = 1:1e6
  v = x*A*x';
end
t2=toc;

%Written out, function call
tic
for t = 1:1e6
  v = unroll(x);
end
t3=toc;

%Matrix form, function call
tic
for t = 1:1e6
  v = regular(x); 
end
t4=toc;

[t1;t2;t3;t4]

这将给出以下结果：

因此，如果您通过（匿名）函数调用它，则使用书面形式不会很有趣，但是如果您真的想获得最佳速度，只需直接使用书面形式就可以让您大大加快速度矩阵。

performance - matlab matrix operation speed

2 回答 2

Related

Reference