2

我有两个数组。一个是另一个中的长度列表。例如

zarray = [1 2 3 4 5 6 7 8 9 10]

lengths = [1 3 2 1 3]

我想对具有第二个给定长度的第一个数组的部分进行平均(平均值)。对于此示例,导致:

[mean([1]),mean([2,3,4]),mean([5,6]),mean([7]),mean([8,9,10])]

为了速度,我试图避免循环。我尝试使用 mat2cell 和 cellfun 如下

zcell = mat2cell(zarray,[1],lengths);
zcellsum = cellfun('mean',zcell);

但是 cellfun 部分非常缓慢。有没有办法在没有循环或 cellfun 的情况下做到这一点?

4

2 回答 2

2

这是一个完全矢量化的解决方案(没有显式的 for 循环,或带有 ARRAYFUN、CELLFUN 的隐藏循环……)。这个想法是使用极快的ACCUMARRAY函数:

%# data
zarray = [1 2 3 4 5 6 7 8 9 10];
lengths = [1 3 2 1 3];

%# generate subscripts: 1 2 2 2 3 3 4 5 5 5
endLocs = cumsum(lengths(:));
subs = zeros(endLocs(end),1);
subs([1;endLocs(1:end-1)+1]) = 1;
subs = cumsum(subs);

%# mean of each part
means = accumarray(subs, zarray) ./ lengths(:)

这种情况下的结果:

means =
            1
            3
          5.5
            7
            9

速度测试:

考虑以下不同方法的比较。我正在使用Steve Eddins 的TIMEIT函数:

function [t,v] = testMeans()
    %# generate test data
    [arr,len] = genData();

    %# define functions
    f1 = @() func1(arr,len);
    f2 = @() func2(arr,len);
    f3 = @() func3(arr,len);
    f4 = @() func4(arr,len);

    %# timeit
    t(1) = timeit( f1 );
    t(2) = timeit( f2 );
    t(3) = timeit( f3 );
    t(4) = timeit( f4 );

    %# return results to check their validity
    v{1} = f1();
    v{2} = f2();
    v{3} = f3();
    v{4} = f4();
end

function [arr,len] = genData()
    %#arr = [1 2 3 4 5 6 7 8 9 10];
    %#len = [1 3 2 1 3];

    numArr = 10000;     %# number of elements in array
    numParts = 500;     %# number of parts/regions      
    arr = rand(1,numArr);
    len = zeros(1,numParts);
    len(1:end-1) = diff(sort( randperm(numArr,numParts) ));
    len(end) = numArr - sum(len);
end

function m = func1(arr, len)
    %# @Drodbar: for-loop
    idx = 1;
    N = length(len);
    m = zeros(1,N);
    for i=1:N
        m(i) = mean( arr(idx+(0:len(i)-1)) );
        idx = idx + len(i);
    end
end

function m = func2(arr, len)
    %# @user1073959: MAT2CELL+CELLFUN
    m = cellfun(@mean, mat2cell(arr, 1, len));
end

function m = func3(arr, len)
    %# @Drodbar: ARRAYFUN+CELLFUN
    idx = arrayfun(@(a,b) a-(0:b-1), cumsum(len), len, 'UniformOutput',false);
    m = cellfun(@(a) mean(arr(a)), idx);
end

function m = func4(arr, len)
    %# @Amro: ACCUMARRAY
    endLocs = cumsum(len(:));
    subs = zeros(endLocs(end),1);
    subs([1;endLocs(1:end-1)+1]) = 1;
    subs = cumsum(subs);

    m = accumarray(subs, arr) ./ len(:);
    if isrow(len)
        m = m';
    end
end

以下是时间安排。测试是在装有 MATLAB R2012a 的 WinXP 32 位机器上进行的。我的方法比所有其他方法快一个数量级。For-loop 是第二好的。

>> [t,v] = testMeans();
>> t
t =
   0.013098   0.013074   0.022407   0.00031807
    |           |          |          \_________ @Amro: ACCUMARRAY (!)
    |           |           \___________________ @Drodbar: ARRAYFUN+CELLFUN
    |            \______________________________ @user1073959: MAT2CELL+CELLFUN
     \__________________________________________ @Drodbar: FOR-loop

此外,所有结果都是正确且相等的——差异按eps机器精度的顺序排列(由累积舍入误差的不同方式引起),因此被认为是垃圾并被忽略:

%#assert( isequal(v{:}) )
>> maxErr = max(max( diff(vertcat(v{:})) ))
maxErr =
   3.3307e-16
于 2012-07-29T15:39:13.537 回答
0

这是使用arrayfun和的解决方案cellfun

zarray  = [1 2 3 4 5 6 7 8 9 10];
lengths = [1 3 2 1 3];

% Generate the indexes for the elements contained within each length specified
% subset. idx would be {[1], [4, 3, 2], [6, 5], [7], [10, 9, 8]} in this case
idx = arrayfun(@(a,b) a-(0:b-1), cumsum(lengths), lengths,'UniformOutput',false);
means = cellfun( @(a) mean(zarray(a)), idx);

您想要的输出结果:

means =

    1.0000    3.0000    5.5000    7.0000    9.0000

在@tmpearce 评论之后,我对上述解决方案进行了快速的时间性能比较,从中我创建了一个名为subsetMeans1

function means = subsetMeans1( zarray, lengths)

% Generate the indexes for the elements contained within each length specified
% subset. idx would be {[1], [4, 3, 2], [6, 5], [7], [10, 9, 8]} in this case
idx = arrayfun(@(a,b) a-(0:b-1), cumsum(lengths), lengths,'UniformOutput',false);
means = cellfun( @(a) mean(zarray(a)), idx);

和一个简单的 for 循环替代方法 function subsetMeans2

function means = subsetMeans2( zarray, lengths)

% Method based on single loop
idx = 1;
N = length(lengths);
means = zeros( 1, N);
for i = 1:N
    means(i) = mean( zarray(idx+(0:lengths(i)-1)) );
    idx = idx+lengths(i);
end

使用基于TIMEIT的下一个测试脚本,它允许检查不同输入向量上的元素数量和每个子集的元素大小的性能:

% Generate some data for the performance test

% Total of elements on the vector to test
nVec = 100000;

% Max of elements per subset
nSubset = 5;

% Data generation aux variables
lenghtsGen = randi( nSubset, 1, nVec);
accumLen = cumsum(lenghtsGen);
maxIdx = find( accumLen < nVec, 1, 'last' );

% % Original test data
% zarray  = [1 2 3 4 5 6 7 8 9 10];
% lengths = [1 3 2 1 3];

% Vector to test
zarray = 1:nVec;
lengths = [ lenghtsGen(1:maxIdx) nVec-accumLen(maxIdx)] ;

% Double check that nVec is will be the max index
assert ( sum(lengths) == nVec)

t1(1) = timeit(@() subsetMeans1( zarray, lengths));
t1(2) = timeit(@() subsetMeans2( zarray, lengths));

fprintf('Time spent subsetMeans1: %f\n',t1(1));
fprintf('Time spent subsetMeans2: %f\n',t1(2));

arrayfun事实证明,没有和更快的非矢量化版本cellfun可能是由于这些函数的额外开销

Time spent subsetMeans1: 2.082457
Time spent subsetMeans2: 1.278473
于 2012-07-28T14:40:41.703 回答