matlab - 更好的（非线性）分箱

Question

我问的最后一个问题是关于如何按 x 坐标对数据进行分箱。解决方案简单而优雅，我很惭愧我没有看到它。这个问题可能更难（或者我可能只是盲目）。

我从大约 140000 个数据点开始，将它们分成 70 个沿 x 轴等距分布的组，然后取每组的平均位置 (x_avg, y_avg) 并绘制它们；一条漂亮的曲线出现了。不幸的是有两个问题。首先，边缘的人口比图的中心少得多；其次，某些领域的变化比其他领域更大，因此需要更好的解决方案。

因此，我有两个具体问题和一个提出建议的一般邀请：

matlab 是否具有将矩阵拆分为固定数量的较小矩阵或固定大小的较小矩阵的内置方法？

是否有算法（或 matlab 函数，但我发现不太可能）来确定更精细地对感兴趣区域进行分类所需的边界？

更一般地说，有没有更好的方法将数以万计的数据点浓缩成一个整齐的趋势？

score 2 · Accepted Answer

听起来您想使用大小根据 x 值的密度而变化的箱。我认为您仍然可以像在上一篇文章的答案中一样使用 HISTC 函数，但您只需要给它一组不同的边。

我不知道这是否正是您想要的，但这里有一个建议：不要将 x 轴分成 70 个等间距的组，而是将排序的 x 数据分成 70 个相等的组并确定边缘值。我认为这段代码应该可以工作：

% Start by assuming x and y are vectors of data:

nBins = 70;
nValues = length(x);
[xsort,index] = sort(x);  % Sort x in ascending order
ysort = y(index);         % Sort y the same way as x
binEdges = [xsort(1:ceil(nValues/nBins):nValues) xsort(nValues)+1];

% Bin the data and get the averages as in previous post (using ysort instead of y):

[h,whichBin] = histc(xsort,binEdges);

for i = 1:nBins
    flagBinMembers = (whichBin == i);
    binMembers = ysort(flagBinMembers);
    binMean(i) = mean(binMembers);
end

这应该为您提供大小随数据密度而变化的箱。

更新：另一个版本...

这是我在几条评论后提出的另一个想法。使用此代码，您可以为 x 中相邻数据点之间的差异设置阈值 (maxDelta)。任何与其较大邻居相差大于或等于 maxDelta 的 x 值都将被强制放在自己的 bin 中（全部由他们的 lonsome）。您仍然可以为 nBins 选择一个值，但是当展开的点被降级到它们自己的 bin 时，最终的 bin 数量将大于此值。

% Start by assuming x and y are vectors of data:

maxDelta = 10; % Or whatever suits your data set!
nBins = 70;
nValues = length(x);
[xsort,index] = sort(x);  % Sort x in ascending order
ysort = y(index);         % Sort y the same way as x

% Create bin edges:

edgeIndex = false(1,nValues);
edgeIndex(1:ceil(nValues/nBins):nValues) = true;
edgeIndex = edgeIndex | ([0 diff(xsort)] >= maxDelta);
nBins = sum(edgeIndex);
binEdges = [xsort(edgeIndex) xsort(nValues)+1];

% Bin the data and get the y averages:

[h,whichBin] = histc(xsort,binEdges);

for i = 1:nBins
    flagBinMembers = (whichBin == i);
    binMembers = ysort(flagBinMembers);
    binMean(i) = mean(binMembers);
end

我在几个小样本数据集上对此进行了测试，它似乎做了它应该做的事情。希望它也适用于您的数据集，无论它包含什么！=)

score 1 · Accepted Answer

我从来没有使用过 matlab，但是从你之前的问题来看，我怀疑你在寻找类似Kdtree或变体的东西。

澄清：由于这似乎有些混乱，我认为一个伪代码示例是有序的。

// Some of this shamelessly borrowed from the wikipedia article
function kdtree(points, lower_bound, upper_bound) {
    // lower_bound and upper_bound are the boundaries of your bucket
    if(points is empty) {
        return nil
    }
    // It's a trivial exercise to control the minimum size of a partition as well
    else {
        // Sort the points list and choose the median element
        select median from points.x

        node.location = median;

        node.left = kdtree(select from points where lower_bound < points.x <= median, lower_bound, median);
        node.right = kdtree(select from points where median < points.x <= upper_bound, median, upper_bound);

        return node
    }
}

kdtree(points, -inf, inf)

// or alternatively

kdtree(points, min(points.x), max(points.x))

matlab - 更好的（非线性）分箱

2 回答 2

Related

Reference