如果您有统计工具箱,请尝试以下操作:
a = dlmread('~\downloads\-data-anonfiles-1383150325725.txt'); % read data
p = mvnpdf(a,mean(a),cov(a)); % multivariate PDF of your data
p_sample = numel(p)*p/sum(p); % normalize pdf to number of samples
thresh = 0.5; % set an arbitrary threshold to filter
idx_thresh = p_sample > thresh; % logical indices of samples that meet the threshold
a_filtered = a(idx_thresh,:);
然后使用过滤后的数据再次重复此操作。
p = mvnpdf(a,mean(a_filtered),cov(a_filtered));
p_sample = numel(p)*p/sum(p); % normalize pdf to number of samples
thresh = 0.1; % set an arbitrary threshold to filter
idx_thresh = p_sample > thresh; % logical indices of samples that meet the threshold
a_filtered = a_filtered (idx_thresh,:);
我能够在 2 次迭代中提取出大部分主要分布。但我认为你会想要重复直到 mean(a_filtered) 和 cov(a_filtered) 达到稳定状态值。将它们绘制为迭代的函数,当它们接近一条平线时,您就找到了正确的值。
这相当于使用旋转椭圆进行过滤,但 IMO 它更容易且更有用,因为现在您实际上拥有重现分布所需的 5 个mvnpdf参数(mu_x、mu_y、sigma_xx、sigma_yy、sigma_xy)。如果将等值线 (p(x,y) = thresh) 建模为旋转椭圆,则必须操纵短轴和长轴 (a,b)、平移坐标 (h,k) 和旋转 ( theta) 来获取 mvnpdf 参数。
然后在提取第一个分布后,您可以重复该过程以找到第二个分布。