performance - 使用陈旧数据从随机间隔数据集构建固定间隔数据集

Question

更新：我对问题文本底部的三个答案进行了简要分析，并解释了我的选择。

我的问题：使用陈旧数据从随机间隔数据集构建固定间隔数据集的最有效方法是什么？

一些背景：以上是统计中的常见问题。通常，一个人有一系列随机发生的观察结果。调用它Input。但是人们希望每 5 分钟发生一次观察序列。调用它Output。构建此数据集的最常见方法之一是使用陈旧数据，即将每个观察值设置为Output等于中最近出现的观察值Input。

所以，这里有一些代码来构建示例数据集：

TInput = 100;
TOutput = 50;

InputTimeStamp = 730486 + cumsum(0.001 * rand(TInput, 1));
Input = [InputTimeStamp, randn(TInput, 1)];

OutputTimeStamp = 730486.002 + (0:0.001:TOutput * 0.001 - 0.001)';
Output = [OutputTimeStamp, NaN(TOutput, 1)];

这两个数据集都是在千禧年之交接近午夜时开始的。但是，时间戳记Input以随机间隔出现，而时间戳记Output以固定间隔出现。为简单起见，我确保第一次观察Input总是发生在第一次观察之前Output。随意在任何答案中做出这个假设。

目前，我解决了这样的问题：

sMax = size(Output, 1);
tMax = size(Input, 1);
s = 1;
t = 2;
%#Loop over input data
while t <= tMax
    if Input(t, 1) > Output(s, 1)
        %#If current obs in Input occurs after current obs in output then set current obs in output equal to previous obs in input
        Output(s, 2:end) = Input(t-1, 2:end);
        s = s + 1;
        %#Check if we've filled out all observations in output
        if s > sMax
            break
        end
        %#This step is necessary in case we need to use the same input observation twice in a row
        t = t - 1;
    end
    t = t + 1;
    if t > tMax
        %#If all remaining observations in output occur after last observation in input, then use last obs in input for all remaining obs in output 
        Output(s:end, 2:end) = Input(end, 2:end);
        break
    end
end

肯定有更有效的，或者至少，更优雅的方式来解决这个问题？正如我所提到的，这是统计中的一个常见问题。也许 Matlab 有一些我不知道的内置功能？任何帮助将不胜感激，因为我将这个例程用于一些大型数据集。

答案：大家好，我已经分析了三个答案，就目前而言，Angainor 是最好的。

ChthonicDaemon 的答案，虽然显然是最容易实现的，但确实很慢。即使在timeseries速度测试之外完成到对象的转换也是如此。我猜这个resample函数目前有很多开销。我正在运行 2011b，因此 Mathworks 可能在此期间对其进行了改进。Output此外，对于在之后结束多个观察的情况，此方法需要额外的一行Input。

Rody 的答案只比 Angainor 的慢一点（这并不奇怪，因为他们都采用了这种histc方法），但是，它似乎有一些问题。首先，分配最后一个观测值的方法对于在最后一个观测值之后发生的最后Output一个观测值并不稳健。这是一个简单的修复。但是还有第二个问题，我认为它源于作为第一个输入而不是被 Angainor 采用。如果您在设置示例输入时更改为，则会出现问题。InputOutputInputTimeStamphistcOutputTimeStampOutputTimeStamp = 730486.002 + (0:0.001:TOutput * 0.001 - 0.001)';OutputTimeStamp = 730486.002 + (0:0.0001:TOutput * 0.0001 - 0.0001)';

安盖诺似乎对我扔给它的所有东西都很强大，而且它是最快的。

我对不同的输入规格做了很多速度测试——以下数字相当有代表性：

我天真的循环：Elapsed time is 8.579535 seconds.

安盖诺：Elapsed time is 0.661756 seconds.

罗迪：Elapsed time is 0.913304 seconds.

Chthonic守护进程：Elapsed time is 22.916844 seconds.

我正在 +1-ing Angainor 的解决方案并标记问题已解决。

score 2 · Accepted Answer

这种“陈旧数据”方法在信号和时间序列字段中被称为零阶保持。快速搜索这个会带来许多解决方案。timeseries如果你有 Matlab 2012b，这一切都是通过使用函数内置到类中的resample，所以你只需做

TInput = 100;
TOutput = 50;

InputTimeStamp = 730486 + cumsum(0.001 * rand(TInput, 1));
InputData = randn(TInput, 1);
InputTimeSeries = timeseries(InputData, InputTimeStamp);

OutputTimeStamp = 730486.002 + (0:0.001:TOutput * 0.001 - 0.001);
OutputTimeSeries = resample(InputTimeSeries, OutputTimeStamp, 'zoh'); % zoh stands for zero order hold

score 1 · Accepted Answer

这是我对这个问题的看法。histc是要走的路：

% find Output timestamps in Input bins
N   = histc(Output(:,1), Input(:,1));

% find counts in the non-empty bins
counts = N(find(N));

% find Input signal value associated with every bin
val = Input(find(N),2);

% now, replicate every entry entry in val
% as many times as specified in counts
index = zeros(1,sum(counts));
index(cumsum([1 counts(1:end-1)'])) = 1;
index = cumsum(index);
val_rep = val(index)

% finish the signal with last entry from Input, as needed
val_rep(end+1:size(Output,1)) = Input(end,2);

% done
Output(:,2) = val_rep;

我针对您的程序检查了几个不同的输入模型（我更改了输出时间戳的数量），结果是相同的。但是，我仍然不确定我是否理解您的问题，所以如果这里有问题，请告诉我。

performance - 使用陈旧数据从随机间隔数据集构建固定间隔数据集

2 回答 2

Related

Reference