algorithm - matlab：线性拟合的最佳点数

Question

我想对几个数据点进行线性拟合，如图所示。由于我知道截距（在本例中为 0.05），因此我只想使用此特定截距拟合线性区域中的点。在这种情况下，可以说点 5:22（但不是 22:30）。我正在寻找一种简单的算法来确定这个最佳点数，基于......嗯，这就是问题...... R^2？任何想法如何做到这一点？我正在考虑使用点 1 到 2:30、2 到 3:30 等来探测 R^2 是否适合，但我真的不知道如何将它包含在清晰简单的函数中。对于固定截距适合我正在使用polyfit0（http://www.mathworks.com/matlabcentral/fileexchange/272-polyfit0-m）。感谢您的任何建议！

编辑：样本数据：

intercept = 0.043;
x = 0.01:0.01:0.3;
y = [0.0530642513911393,0.0600786706929529,0.0673485248329648,0.0794662409166333,0.0895915873196170,0.103837395346484,0.107224784565365,0.120300492775786,0.126318699218730,0.141508831492330,0.147135757370947,0.161734674733680,0.170982455701681,0.191799936622712,0.192312642057298,0.204771365716483,0.222689541632988,0.242582251060963,0.252582727297656,0.267390860166283,0.282890010610515,0.292381165948577,0.307990544720676,0.314264952297699,0.332344368808024,0.355781519885611,0.373277721489254,0.387722683944356,0.413648156978284,0.446500064130389;];

线性拟合

score 4 · Accepted Answer

你在这里遇到的是一个相当困难的问题，很难找到一个通用的解决方案。

一种方法是计算所有连续点对之间的所有斜率/相交，然后对相交进行聚类分析：

slopes = diff(y)./diff(x);  
intersepts = y(1:end-1) - slopes.*x(1:end-1);

idx = kmeans(intersepts, 3);

x([idx; 3] == 2)  % the points with the intersepts closest to the linear one.

这需要统计工具箱（用于kmeans）。这是我尝试过的所有方法中最好的，尽管以这种方式找到的点范围可能有一些小洞；例如，当开始和结束范围内的两点的斜率接近线的斜率时，这些点将被检测为属于线。这个（和其他因素）将需要对以这种方式找到的解决方案进行更多的后处理。

另一种方法（我未能成功构建）是在循环中进行线性拟合，每次从中间的某个点向两个端点增加点的范围，看看平方误差的总和是否仍然很小. 我很快就放弃了，因为定义什么是“小”是非常主观的，必须以某种启发式的方式完成。

我尝试了上述更系统和更强大的方法：

function test

    %% example data
    slope = 2;
    intercept = 1.5;

    x = linspace(0.1, 5, 100).';

    y         = slope*x + intercept;
    y(1:12)   = log(x(1:12)) + y(12)-log(x(12));
    y(74:100) = y(74:100) + (x(74:100)-x(74)).^8;

    y = y + 0.2*randn(size(y));


    %% simple algorithm

    [X,fn] = fminsearch(@(ii)P(ii, x,y,intercept), [0.5 0.5])

    [~,inds] = P(X, y,x,intercept)

end

function [C, inds] = P(ii, x,y,intercept)
% ii represents fraction of range from center to end,
% So ii lies between 0 and 1. 

    N = numel(x);
    n = round(N/2);  

    ii = round(ii*n);

    inds = min(max(1, n+(-ii(1):ii(2))), N);

    % Solve linear system with fixed intercept
    A = x(inds);
    b = y(inds) - intercept;

    % and return the sum of squared errors, divided by 
    % the number of points included in the set. This 
    % last step is required to prevent fminsearch from
    % reducing the set to 1 point (= minimum possible 
    % squared error). 
    C = sum(((A\b)*A - b).^2)/numel(inds);    

end

它只能找到所需索引的粗略近似值（本例中为 12 和 74）。

当fminsearch使用随机起始值（实际上只是rand(1,2)）运行几十次时，它会变得更加可靠，但我仍然不会赌上我的生命。

如果您有统计工具箱，请使用该kmeans选项。

score 1 · Accepted Answer

根据数据值的数量，我会将数据分成相对较少的重叠段，并为每个段计算线性拟合，或者更确切地说是一阶系数，（记住你知道截距，这将是所有细分市场都相同）。

然后，对于每个系数，计算这条假设线和整个数据集之间的 MSE，选择产生最小 MSE 的系数。

algorithm - matlab：线性拟合的最佳点数

2 回答 2

Related

Reference