matlab - 在 MATLAB 中优化重复估计（当前是一个循环）

Question

我发现自己需要对图像中的每个像素进行最小二乘（或类似的基于矩阵的运算）。每个像素都有一组与之关联的数字，因此可以将其排列为 3D 矩阵。

（下一位可以跳过）

快速解释我所说的最小二乘估计的含义：

假设我们有一些由 Y = Ax^2 + Bx + C 建模的二次系统，我们正在寻找那些 A、B、C 系数。对于 X 和相应 Y 的几个样本（至少 3 个），我们可以通过以下方式估计它们：

将（比如说 10 个）X 样本排列成一个矩阵，例如X = [x(:).^2 x(:) ones(10,1)];
将 Y 个样本排列成一个相似的矩阵：Y = y(:);
通过求解来估计系数 A、B、C：coeffs = (X'*X)^(-1)*X'*Y;

如果您愿意，请自行尝试：

A = 5; B = 2; C = 1;
x = 1:10;
y = A*x(:).^2 + B*x(:) + C + .25*randn(10,1); % added some noise here
X = [x(:).^2 x(:) ones(10,1)];
Y = y(:);
coeffs = (X'*X)^-1*X'*Y

coeffs =

  5.0040
  1.9818
  0.9241

如果我在那里失去了你，请重新开始关注

*主要改写*我已对其进行了修改，使其尽可能接近我遇到的实际问题，并且仍然使其成为最小的工作示例。

问题设置

%// Setup
xdim = 500; 
ydim = 500; 
ncoils = 8; 
nshots = 4; 
%// matrix size for each pixel is ncoils x nshots (an overdetermined system)

%// each pixel has a matrix stored in the 3rd and 4rth dimensions
regressor = randn(xdim,ydim, ncoils,nshots); 
regressand = randn(xdim, ydim,ncoils);

所以我的问题是我必须对图像中的每个像素进行 (X'*X)^-1*X'*Y （最小二乘或类似）操作。虽然它本身是矢量化/矩阵化的，但我必须为每个像素执行此操作的唯一方法是在 for 循环中，例如：

原始代码风格

%// Actual work
tic 
estimate = zeros(xdim,ydim);
for col=1:size(regressor,2)
    for row=1:size(regressor,1)

        X = squeeze(regressor(row,col,:,:));
        Y = squeeze(regressand(row,col,:));

        B = X\Y; 
        % B = (X'*X)^(-1)*X'*Y; %// equivalently

        estimate(row,col) = B(1);
    end
end
toc

Elapsed time = 27.6 seconds

针对评论和其他想法进行编辑
我尝试了一些事情：
1. 重新塑造成一个长向量并删除了双for循环。这节省了一些时间。
2. 预先通过 -ing 图片删除squeeze（和在线转置）permute：这节省了更多时间。

当前示例：

%// Actual work
tic 
estimate2 = zeros(xdim*ydim,1);
regressor_mod = permute(regressor,[3 4 1 2]);
regressor_mod = reshape(regressor_mod,[ncoils,nshots,xdim*ydim]);
regressand_mod = permute(regressand,[3 1 2]);
regressand_mod = reshape(regressand_mod,[ncoils,xdim*ydim]);

for ind=1:size(regressor_mod,3) % for every pixel

    X = regressor_mod(:,:,ind);
    Y = regressand_mod(:,ind);

    B = X\Y;

    estimate2(ind) = B(1);

end
estimate2 = reshape(estimate2,[xdim,ydim]);
toc

Elapsed time = 2.30 seconds (avg of 10)
isequal(estimate2,estimate) == 1;

罗迪奥尔登休斯的方式

N  = xdim*ydim*ncoils;  %// number of columns
M  = xdim*ydim*nshots;    %// number of rows

ii = repmat(reshape(1:N,[ncoils,xdim*ydim]),[nshots 1]); %//column indicies
jj = repmat(1:M,[ncoils 1]); %//row indicies

X = sparse(ii(:),jj(:),regressor_mod(:));
Y = regressand_mod(:);

B = X\Y;

B = reshape(B(1:nshots:end),[xdim ydim]);

Elapsed time = 2.26 seconds (avg of 10) 
            or 2.18 seconds (if you don't include the definition of N,M,ii,jj)

所以问题是：
有没有（甚至）更快的方法？

（我不这么认为。）

score 6 · Accepted Answer

您可以通过预先计算 X 的转置来实现 2 倍的加速。即

for x=1:size(picture,2) % second dimension b/c already transposed

    X = picture(:,x);
    XX = X';
    Y = randn(n_timepoints,1);
    %B = (X'*X)^-1*X'*Y; ;
    B = (XX*X)^-1*XX*Y; 
    est(x) = B(1);

end

Before: Elapsed time is 2.520944 seconds.
After: Elapsed time is 1.134081 seconds.

编辑：您的代码，就像您最新的编辑一样，可以替换为以下内容

tic
xdim = 500; 
ydim = 500; 
n_timepoints = 10; % for example

% Actual work
picture = randn(xdim,ydim,n_timepoints);
picture = reshape(picture, [xdim*ydim,n_timepoints])'; % note transpose
YR = randn(n_timepoints,size(picture,2));

% (XX*X).^-1 = sum(picture.*picture).^-1;
% XX*Y = sum(picture.*YR);
est = sum(picture.*picture).^-1 .* sum(picture.*YR);

est = reshape(est,[xdim,ydim]);
toc

Elapsed time is 0.127014 seconds.

这是最新编辑的一个数量级加速，结果几乎与以前的方法相同。

编辑2：

好的，所以如果 X 是一个矩阵，而不是一个向量，事情会稍微复杂一些。我们基本上希望尽可能多地在 for 循环之外进行预计算，以降低成本。我们还可以通过手动计算来显着加快速度XT*X——因为结果总是对称矩阵，我们可以走捷径来加快速度。一、对称乘法函数：

function XTX = sym_mult(X) % X is a 3-d matrix

n = size(X,2);
XTX = zeros(n,n,size(X,3));
for i=1:n
    for j=i:n
        XTX(i,j,:) = sum(X(:,i,:).*X(:,j,:));
        if i~=j
            XTX(j,i,:) = XTX(i,j,:);
        end
    end
end

现在是实际的计算脚本

xdim = 500; 
ydim = 500; 
n_timepoints = 10; % for example

Y = randn(10,xdim*ydim);
picture = randn(xdim,ydim,n_timepoints); % 500x500x10

% Actual work  
tic  % start timing

picture = reshape(picture, [xdim*ydim,n_timepoints])';

% Here we precompute the (XT*Y) calculation to speed things up later
picture_y = [sum(Y);sum(Y.*picture)]; 

% initialize
est = zeros(size(picture,2),1); 

picture = permute(picture,[1,3,2]);
XTX = cat(2,ones(n_timepoints,1,size(picture,3)),picture);
XTX = sym_mult(XTX); % precompute (XT*X) for speed

X = zeros(2,2); % preallocate for speed
XY = zeros(2,1);

for x=1:size(picture,2) % second dimension b/c already transposed

    %For some reason this is a lot faster than X = XTX(:,:,x);
    X(1,1) = XTX(1,1,x);
    X(2,1) = XTX(2,1,x);
    X(1,2) = XTX(1,2,x);
    X(2,2) = XTX(2,2,x);
    XY(1) = picture_y(1,x);
    XY(2) = picture_y(2,x);

    % Here we utilise the fact that A\B is faster than inv(A)*B
    % We also use the fact that (A*B)*C = A*(B*C) to speed things up
    B = X\XY;
    est(x) = B(1); 
end
est = reshape(est,[xdim,ydim]); 
toc % end timing

Before: Elapsed time is 4.56 seconds.
After: Elapsed time is 2.24 seconds.

这是大约 2 倍的加速。此代码应该可以扩展到 X 是您想要的任何尺寸。例如，在 X = [1 xx^2] 的情况下，您将更改为picture_y以下

picture_y = [sum(Y);sum(Y.*picture);sum(Y.*picture.^2)];

并更改XTX为

XTX = cat(2,ones(n_timepoints,1,size(picture,3)),picture,picture.^2);

您还将代码中的大量 2s 更改为 3s，并添加XY(3) = picture_y(3,x)到循环中。我相信这应该是相当直接的。

score 3 · Accepted Answer

结果

我加快了您的原始版本，因为您的编辑 3 实际上不起作用（并且还做了一些不同的事情）。

所以，在我的电脑上：

您的（原始）版本：8.428473 seconds。
下面给出了我的模糊单行：0.964589 seconds。

首先，除了给人留下深刻印象之外别无其他原因，我会在我写的时候给出它：

%%// Some example data
xdim = 500; 
ydim = 500; 
n_timepoints = 10; % for example
estimate = zeros(xdim,ydim); %// initialization with explicit size

picture = randn(xdim,ydim,n_timepoints);


%%// Your original solution
%// (slightly altered to make my version's results agree with yours)

tic

Y = randn(n_timepoints,xdim*ydim);
ii = 1;
for x = 1:xdim
    for y = 1:ydim

        X = squeeze(picture(x,y,:)); %// or similar creation of X matrix

        B = (X'*X)^(-1)*X' * Y(:,ii);
        ii = ii+1;

        %// sometimes you keep everything and do
        %// estimate(x,y,:) = B(:);
        %// sometimes just the first element is important and you do
        estimate(x,y) = B(1);

    end
end

toc


%%// My version 

tic

%// UNLEASH THE FURY!!
estimate2 = reshape(sparse(1:xdim*ydim*n_timepoints, ...
    builtin('_paren', ones(n_timepoints,1)*(1:xdim*ydim),:), ...
    builtin('_paren', permute(picture, [3 2 1]),:))\Y(:), ydim,xdim).';  %'

toc

%%// Check for equality

max(abs(estimate(:)-estimate2(:)))  % (always less than ~1e-14)

分解

首先，这是您应该实际使用的版本：

%// Construct sparse block-diagonal matrix
%// (Type "help sparse" for more information)
N  = xdim*ydim;      %// number of columns
M  = N*n_timepoints; %// number of rows
ii = 1:N;
jj = ones(n_timepoints,1)*(1:N);
s  = permute(picture, [3 2 1]);
X  = sparse(ii,jj(:), s(:));

%// Compute ALL the estimates at once
estimates = X\Y(:);

%// You loop through the *second* dimension first, so to make everything
%// agree, we have to extract elements in the "wrong" order, and transpose:
estimate2 = reshape(estimates, ydim,xdim).';  %'

这是一个示例，说明了的内容picture和相应的矩阵的X样子xdim = ydim = n_timepoints = 2：

>> clc, picture, full(X)

picture(:,:,1) =
   -0.5643   -2.0504
   -0.1656    0.4497
picture(:,:,2) =
    0.6397    0.7782
    0.5830   -0.3138

ans =
   -0.5643         0         0         0
    0.6397         0         0         0
         0   -2.0504         0         0
         0    0.7782         0         0
         0         0   -0.1656         0
         0         0    0.5830         0
         0         0         0    0.4497
         0         0         0   -0.3138

你可以看到为什么sparse是必要的——它主要是零，但会很快变大。完整的矩阵会很快消耗你所有的 RAM，而sparse一个不会比原始picture矩阵消耗更多的内存。

有了这个矩阵X，新的问题

X·b = Y

现在包含所有问题

X1 · b1 = Y1
X2 · b2 = Y2
...

在哪里

b = [b1; b2; b3; ...]
Y = [Y1; Y2; Y3; ...]

所以，单个命令

X\Y

将立即解决您的所有系统。

这将所有繁重的工作转移到一组高度专业化、编译为特定于机器的代码、全方位优化的算法上，而不是在 MATLAB 中解释的、通用的、总是两步之遥的硬件循环。

将其转换X为矩阵的版本应该很简单；你最终会得到类似于 what blkdiagdoes 的东西，它也可以以mldivide与上面完全相同的方式使用。

score 3 · Accepted Answer

我有一个想法，我决定把它作为一个单独的答案，因为它与我的另一个想法完全不同，我实际上并不宽恕我将要做的事情。我认为这是迄今为止最快的方法：

原始（未优化）：13.507176 秒。
快速 Cholesky 分解法：0.424464 秒

首先，我们有一个函数可以快速进行X'*X乘法运算。我们可以在这里加快速度，因为结果总是对称的。

function XX = sym_mult(X)

n = size(X,2);
XX = zeros(n,n,size(X,3));
for i=1:n
    for j=i:n
        XX(i,j,:) = sum(X(:,i,:).*X(:,j,:));
        if i~=j
            XX(j,i,:) = XX(i,j,:);
        end
    end
end

我们有一个函数可以对 3D 矩阵进行 LDL Cholesky 分解（我们可以这样做，因为(X'*X)矩阵总是对称的），然后进行正向和反向替换以求解 LDL 反演方程

function Y = fast_chol(X,XY)

n=size(X,2);
L = zeros(n,n,size(X,3));
D = zeros(n,n,size(X,3));
B = zeros(n,1,size(X,3));
Y = zeros(n,1,size(X,3));
% These loops compute the LDL decomposition of the 3D matrix
for i=1:n
    D(i,i,:) = X(i,i,:);
    L(i,i,:) = 1;
    for j=1:i-1
        L(i,j,:) = X(i,j,:);
        for k=1:(j-1)
            L(i,j,:) = L(i,j,:) - L(i,k,:).*L(j,k,:).*D(k,k,:);
        end
        D(i,j,:) = L(i,j,:);
        L(i,j,:) = L(i,j,:)./D(j,j,:);
        if i~=j
            D(i,i,:) = D(i,i,:) - L(i,j,:).^2.*D(j,j,:);
        end
    end
end

for i=1:n
    B(i,1,:) = XY(i,:);
    for j=1:(i-1)
        B(i,1,:) = B(i,1,:)-D(i,j,:).*B(j,1,:);
    end
    B(i,1,:) = B(i,1,:)./D(i,i,:);
end

for i=n:-1:1
    Y(i,1,:) = B(i,1,:);
    for j=n:-1:(i+1)
        Y(i,1,:) = Y(i,1,:)-L(j,i,:).*Y(j,1,:);
    end
end

最后，我们有调用所有这些的主脚本

xdim = 500; 
ydim = 500; 
n_timepoints = 10; % for example

Y = randn(10,xdim*ydim);
picture = randn(xdim,ydim,n_timepoints); % 500x500x10

tic  % start timing

picture = reshape(pr, [xdim*ydim,n_timepoints])';
% Here we precompute the (XT*Y) calculation
picture_y = [sum(Y);sum(Y.*picture)];

% initialize
est2 = zeros(size(picture,2),1); 

picture = permute(picture,[1,3,2]);
% Now we calculate the X'*X matrix
XTX = cat(2,ones(n_timepoints,1,size(picture,3)),picture);
XTX = sym_mult(XTX);

% Call our fast Cholesky decomposition routine
B = fast_chol(XTX,picture_y);
est2 = B(1,:);

est2 = reshape(est2,[xdim,ydim]); 
toc

同样，这对于 Nx3 X 矩阵应该同样适用，或者无论你想要多大。

score 0 · Accepted Answer

我使用八度音阶，因此我不能说任何关于 Matlab 中的性能，但希望这段代码会稍微快一些：

pictureT=picture'
est=arrayfun(@(x)( (pictureT(x,:)*picture(:,x))^-1*pictureT(x,:)*randn(n_ti
mepoints,1)),1:size(picture,2));

matlab - 在 MATLAB 中优化重复估计（当前是一个循环）

4 回答 4

结果

分解

Related

Reference