performance - 使用 bsxfun 和 GPU 在 matlab 中矢量化嵌套循环

Question

For 循环似乎非常慢，所以我想知道接下来显示的代码中的嵌套循环是否可以使用矢量化bsxfun，也许也可以引入 GPU。

代码

%// Paramaters
i = 1;
j = 3;
n1 = 1500;
n2 = 1500;

%// Pre-allocate for output
LInc(n1+n2,n1+n2)=0;

%// Nested Loops - I 
for x = 1:n1
    for y = 1:n1
        num = ((n2 ^ 2) * (L1(i, i) + L2(j, j) + 1)) - (n2 * n * (L1(x,i) + L1(y,i)));
        LInc(x, y) = L1(x, y) + (num/denom);
        LInc(y, x) = LInc(x, y);
    end
end

%// Nested Loops - II
for x = 1:n1
    for y = 1:n2
        num = (n1 * n * L1(x,i)) + (n2 * n * L2(y,j)) - ((n1 * n2 * (L1(i, i) + L2(j, j) + 1)));
        LInc(x, n1+y) = num/denom;
        LInc(n1+y, x) = LInc(x, n1+y);
    end
end

编辑1： n也denom可以假设为常量。

score 8 · Accepted Answer

这里是矢量化CPU和GPU代码，我希望我至少在GPU代码和以后的基准测试中使用良好的实践。

中央处理器代码

%// Pre-allocate for output
LInc(n1+n2,n1+n2)=0;

%// Calculate num/denom value for stage 1 and 2
nd1 = L1 + (((n2 ^ 2) * (L1(i, i) + L2(j, j) + 1)) - n2*n*bsxfun(@plus,L1(:,i),L1(:,i).'))./denom; %//'
nd2 = (bsxfun(@plus,n1*n*L1(:,i),n2*n*L2(:,j).') - ((n1 * n2 * (L1(i, i) + L2(j, j) + 1))))./denom; %//'

%// Plug in the values in the output matrix
LInc(1:n1,1:n1) = tril(nd1) + tril(nd1,-1).'; %//'
LInc(n1+1:end,1:n1) = nd2.';  %//'
LInc(1:n1,n1+1:end) = nd2;

GPU 代码

%// Pre-allocate for output
gLInc = zeros(n1+n2,n1+n2,'gpuArray');

%// Convert to gpu arrays
gL1 = gpuArray(L1);
gL2 = gpuArray(L2);

%// Calculate num/denom value for stage 1 and 2
nd1 = gL1 + (((n2 ^ 2) * (gL1(i, i) + gL2(j, j) + 1)) - n2*n*bsxfun(@plus,gL1(:,i),gL1(:,i).'))./denom; %//'
nd2 = (bsxfun(@plus,n1*n*gL1(:,i),n2*n*gL2(:,j).') - ((n1 * n2 * (gL1(i, i) + gL2(j, j) + 1))))./denom; %//'

%// Plug in the values in the output matrix
gLInc(1:n1,1:n1) = tril(nd1) + tril(nd1,-1).'; %//'
gLInc(n1+1:end,1:n1) = nd2.';  %//'
gLInc(1:n1,n1+1:end) = nd2;

%// Gather data from GPU back to CPU
LInc = gather(gLInc);

基准测试

GPU 基准测试技巧取自Measure and Improvement GPU Performance。

%// Warm up GPU call with insignificant small scalar inputs, just in case
%// gputimeit doesn't do the same
temp1 = modp2(1,1,1,1,1,1,1,1); %// This is vectorized GPU code

i = 1;
j = 3;
n = 1000; %// Assumed
denom = 1e6;  %// Assumed

N_arr = [50 100 200 500 1000 1500]; %// array elements for N (datasize)
timeall = zeros(3,numel(N_arr));

for k1 = 1:numel(N_arr)
    N = N_arr(k1);
    n1 = N;  %// n1, n2 are assumed identical for less-complicated benchmarking
    n2 = N;

    L1 = rand(n1,n1);
    L2 = rand(n2,j);

    f = @() modp0(i,j,n1,n2,L1,L2,n,denom);%// Original CPU w/ preallocation
    timeall(1,k1) = timeit(f);
    clear f

    f = @() modp1(i,j,n1,n2,L1,L2,n,denom);%// Vectorzied CPU code
    timeall(2,k1) = timeit(f);
    clear f

    f = @() modp2(i,j,n1,n2,L1,L2,n,denom);%// Vectorized GPU(GTX 750Ti) code
    timeall(3,k1) = gputimeit(f);
    clear f
end

%// Display benchmark results
figure,hold on, grid on
plot(N_arr,timeall(1,:),'-b.')
plot(N_arr,timeall(2,:),'-ro')
plot(N_arr,timeall(3,:),'-kx')
legend('Original CPU','Vectorized CPU','Vectorized GPU (GTX 750 Ti)')
xlabel('Datasize (N) ->'),ylabel('Time(sec) ->')

结果

在此处输入图像描述

结论

结果表明，矢量化 GPU 代码在数据量更大的情况下表现得非常好，并且从比矢量化 CPU 和原始代码都慢到比矢量化 CPU 代码快两倍。

score 0 · Accepted Answer

如果您还没有这样做，您应该预先分配 LInc。

LInc = zeros(n1,n2);

如果你想对它进行矢量化，你不需要使用 bsxfun 来对你的代码进行矢量化。我认为你可以做类似的事情

x = 1:n1;
y = 1:n1;
num = ((n2 ^ 2) * (L1(i, i) + L2(j, j) + 1)) - (n2 * n * (L1(x,i) + L1(y,i)));
                    LInc(x, y) = L1(x, y) + (num/denom);

但是，这段代码让我感到困惑，因为实际上，您将多次覆盖 LInc 的值。在不知道您的目标是什么的情况下，我很难提供更多帮助。上面的代码可能不会返回与您的函数相同的值。

performance - 使用 bsxfun 和 GPU 在 matlab 中矢量化嵌套循环

2 回答 2

中央处理器代码

GPU 代码

基准测试

结果

结论

Related

Reference