performance - Matlab中的内存和执行速度

Question

我正在尝试创建随机线条并选择其中一些，这非常罕见。我的代码相当简单，但要获得我可以使用的东西，我需要创建非常大的向量（即：<100000000 x 1，在我的代码中跟踪变量）。有什么方法可以创建更大的向量并减少所有这些计算所需的时间？

我的代码是

%Initial line values

tracks=input('Give me the number of muon tracks: ');
width=1e-4;
height=2e-4;

Ystart=15.*ones(tracks,1);
Xstart=-40+80.*rand(tracks,1);
%Xend=-40+80.*rand(tracks,1);
Xend=laprnd(tracks,1,Xstart,15);
X=[Xstart';Xend'];
Y=[Ystart';zeros(1,tracks)];
b=(Ystart.*Xend)./(Xend-Xstart);
hot=0;
cold=0;

for i=1:tracks
    if ((Xend(i,1)<width/2 && Xend(i,1)>-width/2)||(b(i,1)<height && b(i,1)>0))
        plot(X(:, i),Y(:, i),'r');%the chosen ones!
        hold all
        hot=hot+1;
    else
        %plot(X(:, i),Y(:, i),'b');%the rest of them
        %hold all
        cold=cold+1;
    end
end

我也在使用并调用一个拉普拉斯分布生成器制作了我的 Elvis Chen，可以在这里找到

function y  = laprnd(m, n, mu, sigma)
%LAPRND generate i.i.d. laplacian random number drawn from laplacian distribution
%   with mean mu and standard deviation sigma. 
%   mu      : mean
%   sigma   : standard deviation
%   [m, n]  : the dimension of y.
%   Default mu = 0, sigma = 1. 
%   For more information, refer to
%   http://en.wikipedia.org./wiki/Laplace_distribution

%   Author  : Elvis Chen (bee33@sjtu.edu.cn)
%   Date    : 01/19/07

%Check inputs
if nargin < 2
    error('At least two inputs are required');
end

if nargin == 2
    mu = 0; sigma = 1;
end

if nargin == 3
    sigma = 1;
end

% Generate Laplacian noise
u = rand(m, n)-0.5;
b = sigma / sqrt(2);
y = mu - b * sign(u).* log(1- 2* abs(u));

结果图是

score 3 · Accepted Answer

正如您所指出的，您的问题有两个方面。一方面，你有记忆问题，因为你需要做很多试验。另一方面，您有性能问题，因为您必须处理所有这些试验。

每个问题的解决方案通常会对另一个问题产生负面影响。恕我直言，最好的方法是找到妥协。

只有摆脱矢量化所需的那些庞大的数组，并使用不同的策略来执行循环，才能进行更多试验。我将优先考虑使用更多试验的可能性，可能以最佳性能为代价。

当我在Matlab profiler中按原样执行您的代码时，它立即显示所有变量的初始内存分配需要大量时间。它还表明plotandhold all命令是它们中最耗时的行。更多的反复试验表明，在错误开始出现trials之前，您可以做的最大值非常低，令人失望。OUT OF MEMORY

如果你知道一些关于它在 Matlab 中的局限性的事情，那么循环可以极大地加速。在旧版本的 Matlab 中，应该完全避免循环以支持“矢量化”代码，这是事实。在最近的版本中（我相信是 R2008a 及更高版本），Mathworks 引入了一种称为 JIT 加速器（即时编译器）的技术，它在执行过程中将 M 代码即时翻译成机器语言。简单地说，JIT 加速器可以让你的代码绕过 Matlab 的解释器，更直接地与底层硬件对话，从而节省大量时间。

您会听到很多关于在 Matlab 中应避免循环的建议，但通常不再是正确的。尽管矢量化仍然有其价值，但仅使用矢量化代码实现的任何相当复杂的过程通常都难以辨认、难以理解、难以更改和难以维护。使用循环的相同过程的实现通常没有这些缺点，而且它通常会更快并且需要更少的内存。

不幸的是，JIT 加速器有一些令人讨厌的（恕我直言，不必要的）限制，您必须了解这些限制。

其中一件事是plot；让循环除了收集和操作数据之外什么都不做，并将任何绘图命令等延迟到循环之后，通常是一个更好的主意。

另一个这样的事情是hold；该hold函数不是Matlab内置函数，意思是，它是用M语言实现的。Matlab 的 JIT 加速器在循环中使用时无法加速非内置函数，这意味着您的整个循环将以 Matlab 的解释速度运行，而不是机器语言的速度！因此，也将此命令延迟到循环之后:)

现在，如果您想知道，这最后一步可能会产生巨大的影响——我知道有一种情况，将函数体复制粘贴到上层循环中会导致性能提高 1200 倍。执行时间已减少到几分钟！）。

实际上，您的循环中还有另一个小问题（它非常小，而且相当不方便，我会立即同意） - 循环变量的名称不应该是i. 名称i是Matlab中虚数单位的名称，名称解析也会在每次迭代中不必要地消耗时间。它很小，但不容忽视。

现在，考虑到这一切，我得出了以下实现：

function [hot, cold, h] = MuonTracks(tracks)

    % NOTE: no variables larger than 1x1 are initialized

    width  = 1e-4;
    height = 2e-4;

    % constant used for Laplacian noise distribution  
    bL = 15 / sqrt(2);

    % Loop through all tracks
    X = [];
    hot = 0;  
    ii = 0;          
    while ii <= tracks

        ii = ii + 1;

        % Note that I've inlined (== copy-pasted) the original laprnd()
        % function call. This was necessary to work around limitations 
        % in loops in Matlab, and prevent the nececessity of those HUGE 
        % variables. 
        %
        % Of course, you can still easily generalize all of this: 

        % the new data
        u = rand-0.5;

        Ystart = 15; 
        Xstart = 800*rand-400;
        Xend   = Xstart - bL*sign(u)*log(1-2*abs(u));

        b = (Ystart*Xend)/(Xend-Xstart);


        % the test
        if ((b < height && b > 0)) ||...
            (Xend < width/2 && Xend > -width/2)

            hot = hot+1;

            % growing an array is perfectly fine when the chances of it
            % happening are so slim
            X = [X [Xstart; Xend]]; %#ok

        end
    end

    % This is trivial to do here, and prevents an 'else' in the loop
    cold = tracks - hot;

    % Now plot the chosen ones
    h = figure;
    hold all
    Y = repmat([15;0], 1, size(X,2));
    plot(X, Y, 'r'); 

end

有了这个实现，我可以这样做：

>> tic, MuonTracks(1e8); toc
Elapsed time is 24.738725 seconds.

内存占用完全可以忽略不计。

分析器现在还显示了代码中工作量的均匀分布；没有因为内存使用或性能而真正脱颖而出的行。

这可能不是最快的实现（如果有人看到明显的改进，请随时编辑它们）。但是，如果您愿意等待，您将能够做到MuonTracks(1e23)（或更高:)

我还在 C 中做了一个实现，可以编译成 Matlab MEX 文件：

/* DoMuonCounting.c */

#include <math.h>
#include <matrix.h>
#include <mex.h>
#include <time.h>
#include <stdlib.h>


void CountMuons(
    unsigned long long tracks,
    unsigned long long *hot, unsigned long long *cold, double *Xout);


/* simple little helper functions */
double sign(double x) { return (x>0)-(x<0); }
double rand_double()  { return (double)rand()/(double)RAND_MAX; }


/* the gateway function */
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
    int
        dims[] = {1,1};

    const mxArray
        /* Output arguments */
        *hot_out  = plhs[0] = mxCreateNumericArray(2,dims, mxUINT64_CLASS,0),
        *cold_out = plhs[1] = mxCreateNumericArray(2,dims, mxUINT64_CLASS,0),
        *X_out    = plhs[2] = mxCreateDoubleMatrix(2,10000, mxREAL);

    const unsigned long long
        tracks = (const unsigned long long)mxGetPr(prhs[0])[0];

    unsigned long long
        *hot  = (unsigned long long*)mxGetPr(hot_out),
        *cold = (unsigned long long*)mxGetPr(cold_out);
    double
        *Xout = mxGetPr(X_out);

    /* call the actual function, and return */
    CountMuons(tracks, hot,cold, Xout);
}


// The actual muon counting
void CountMuons(
    unsigned long long tracks,
    unsigned long long *hot, unsigned long long *cold, double *Xout)
{
    const double
        width  = 1.0e-4,
        height = 2.0e-4,
        bL     = 15.0/sqrt(2.0),
        Ystart = 15.0;

    double
        Xstart,
        Xend,
        u,
        b;
    unsigned long long
        i = 0ul;


    *hot  = 0ul;
    *cold = tracks;

    /* seed the RNG */
    srand((unsigned)time(NULL));

    /* aaaand start! */
    while (i++ < tracks)
    {
        u = rand_double() - 0.5;

        Xstart = 800.0*rand_double() - 400.0;
        Xend   = Xstart - bL*sign(u)*log(1.0-2.0*fabs(u));

        b = (Ystart*Xend)/(Xend-Xstart);

        if ((b < height && b > 0.0) || (Xend < width/2.0 && Xend > -width/2.0))
        {
            Xout[0 + *hot*2] = Xstart;
            Xout[1 + *hot*2] = Xend;
            ++(*hot);
            --(*cold);
        }
    }
}

在 Matlab 中编译

mex DoMuonCounting.c

（在运行之后mex setup:)，然后将它与一个小的 M-wrapper 结合使用，如下所示：

function [hot,cold, h] = MuonTrack2(tracks)

    % call the MEX function 
    [hot,cold, Xtmp] = DoMuonCounting(tracks);

    % process outputs, and generate plots

    hot = uint32(hot); % circumvents limitations in 32-bit matlab

    X = Xtmp(:,1:hot);
    clear Xtmp

    h = NaN;
    if ~isempty(X)
        h = figure;
        hold all         
        Y = repmat([15;0], 1, hot);
        plot(X, Y, 'r');       
    end    
end

这让我可以做

>> tic, MuonTrack2(1e8); toc
Elapsed time is 14.496355 seconds.

请注意，MEX 版本的内存占用略大，但我认为这没什么好担心的。

我看到的唯一缺陷是固定的最大 Muon 计数（硬编码为 10000 作为初始数组大小Xout；需要，因为标准 C 中没有动态增长的数组）......如果你担心这个限制可能是坏了，只需增加它，将其更改为等于的一小部分tracks，或者做一些更聪明（但更痛苦）的动态数组增长技巧。

score 2 · Accepted Answer

在 Matlab 中，向量化有时比使用for循环更快。例如，这个表达式：

(Xend(i,1) < width/2 && Xend(i,1) > -width/2) || (b(i,1) < height && b(i,1) > 0)

它是为的每个值定义的i，可以像这样以矢量化方式重写：

isChosen = (Xend(:,1) < width/2 & Xend(:,1) > -width/2) | (b(:,1) < height & b(:,1)>0)

像这样的表达式Xend(:,1)会给你一个列向量，所以Xend(:,1) < width/2会给你一个布尔值的列向量。请注意，我使用&了而不是&&- 这是因为&执行逐元素逻辑与，不像&&它只适用于标量值。通过这种方式，您可以构建整个表达式，以便该变量包含一个布尔值的列向量，一个用于/向量isChosen的每一行。Xendb

获取计数现在就像这样简单：

hot = sum(isChosen);

因为true由表示1。和：

cold = sum(~isChosen);

最后，您可以通过使用布尔向量选择行来获取数据点：

plot(X(:, isChosen),Y(:, isChosen),'r');    % Plot chosen values
hold all;
plot(X(:, ~isChosen),Y(:, ~isChosen),'b');  % Plot unchosen values

编辑：代码应如下所示：

isChosen = (Xend(:,1) < width/2 & Xend(:,1) > -width/2) | (b(:,1) < height & b(:,1)>0);
hot = sum(isChosen);
cold = sum(~isChosen);
plot(X(:, isChosen),Y(:, isChosen),'r');    % Plot chosen values

performance - Matlab中的内存和执行速度

2 回答 2

Related

Reference