matlab - 大型数据集上更快的 grpstats

Question

我有一个大型 Matlab 数据集（1,924,014 x 5；~73.4 MB）

Date          id            a           b           c
...
733234        1467          1.2656      1.2718      51.16    
733235        1467          1.2732      1.2794      51.16    
733236        1467          1.2781      1.2844       51.5    
733236        1467            1.26         NaN        NaN    
733237        1467          1.3084         NaN        NaN    
733237        1467          1.3205         NaN        NaN    
733238        1467          1.3125      1.3188      53.85    
733238        1467             1.3         NaN        NaN    
...

Date是datenum表格中的日期。
我需要平均（忽略NaNs）唯一Date+id对的最后三列，因为有时给定的 Date+id 对有不止一行。

我想要的输出是

Date          id            mean_a      mean_b      mean_c
...
735234        1467          1.2656      1.2718      51.16    
735235        1467          1.2732      1.2794      51.16    
735236        1467          1.2691      1.2844       51.5    
735237        1467          1.3144         NaN        NaN    
735238        1467          1.3062      1.3188      53.85    
...

我希望能够使用

grpstats(myDataset, {'Date', 'id'}, 'mean')

但它速度慢得令人望而却步。我预计这项任务可以在 60 秒内完成。我认为grpstats是添加一个 GroupCount 列并为每个观察添加名称，这是我不需要的。

我怎样才能快速做到这一点？无论他们是否使用grpstats.

score 4 · Accepted Answer

按 Date 和 id 用分组，然后用或显式用产生多列的unique(...,'rows')累积，最后用用取一个：subsmeshgrid()repmat()@nanmeanaccumarray()

% Group by date and id
[un,~,pos] = unique(db(:,1:2),'rows');

% Produce row, col subs 
[col,row] = meshgrid(1:3,pos);

% Accumulate 
[un accumarray([row(:), col(:)], reshape(db(:,3:5),[],1),[],@nanmean)]

matlab - 大型数据集上更快的 grpstats

1 回答 1

Related

Reference