algorithm - 在曲线上找到最佳权衡点

Question

假设我有一些数据，我想为其拟合一个参数化模型。我的目标是找到这个模型参数的最佳值。

我正在使用AIC / BIC / MDL类型的标准进行模型选择，该标准奖励具有低错误的模型但也惩罚具有高复杂性的模型（可以说，我们正在为这些数据寻求最简单但最有说服力的解释，a la Occam's剃须刀）。

继上述之后，这是我根据三个不同标准（两个要最小化，一个要最大化）得到的那种东西的一个例子：

有氧运动

在视觉上，您可以轻松地看到肘部形状，并且您可以在该区域的某处为参数选择一个值。问题是我正在为大量实验这样做，我需要一种无需干预即可找到此值的方法。

我的第一个直觉是尝试从拐角处以 45 度角画一条线并继续移动它直到它与曲线相交，但这说起来容易做起来难 :) 如果曲线有些倾斜，它也可能会错过感兴趣的区域。

关于如何实现这一点或更好的想法的任何想法？

以下是重现上述图表之一所需的样本：

curve = [8.4663 8.3457 5.4507 5.3275 4.8305 4.7895 4.6889 4.6833 4.6819 4.6542 4.6501 4.6287 4.6162 4.585 4.5535 4.5134 4.474 4.4089 4.3797 4.3494 4.3268 4.3218 4.3206 4.3206 4.3203 4.2975 4.2864 4.2821 4.2544 4.2288 4.2281 4.2265 4.2226 4.2206 4.2146 4.2144 4.2114 4.1923 4.19 4.1894 4.1785 4.178 4.1694 4.1694 4.1694 4.1556 4.1498 4.1498 4.1357 4.1222 4.1222 4.1217 4.1192 4.1178 4.1139 4.1135 4.1125 4.1035 4.1025 4.1023 4.0971 4.0969 4.0915 4.0915 4.0914 4.0836 4.0804 4.0803 4.0722 4.065 4.065 4.0649 4.0644 4.0637 4.0616 4.0616 4.061 4.0572 4.0563 4.056 4.0545 4.0545 4.0522 4.0519 4.0514 4.0484 4.0467 4.0463 4.0422 4.0392 4.0388 4.0385 4.0385 4.0383 4.038 4.0379 4.0375 4.0364 4.0353 4.0344];
plot(1:100, curve)

编辑

我接受了Jonas给出的解决方案。基本上，对于p曲线上的每个点，我们找到最大距离d为：

点线距离

score 49 · Accepted Answer

找到肘部的一种快速方法是从曲线的第一个点到最后一个点画一条线，然后找到离该线最远的数据点。

这当然在一定程度上取决于您在直线的平坦部分拥有的点数，但是如果您每次测试相同数量的参数，它应该会相当不错。

curve = [8.4663 8.3457 5.4507 5.3275 4.8305 4.7895 4.6889 4.6833 4.6819 4.6542 4.6501 4.6287 4.6162 4.585 4.5535 4.5134 4.474 4.4089 4.3797 4.3494 4.3268 4.3218 4.3206 4.3206 4.3203 4.2975 4.2864 4.2821 4.2544 4.2288 4.2281 4.2265 4.2226 4.2206 4.2146 4.2144 4.2114 4.1923 4.19 4.1894 4.1785 4.178 4.1694 4.1694 4.1694 4.1556 4.1498 4.1498 4.1357 4.1222 4.1222 4.1217 4.1192 4.1178 4.1139 4.1135 4.1125 4.1035 4.1025 4.1023 4.0971 4.0969 4.0915 4.0915 4.0914 4.0836 4.0804 4.0803 4.0722 4.065 4.065 4.0649 4.0644 4.0637 4.0616 4.0616 4.061 4.0572 4.0563 4.056 4.0545 4.0545 4.0522 4.0519 4.0514 4.0484 4.0467 4.0463 4.0422 4.0392 4.0388 4.0385 4.0385 4.0383 4.038 4.0379 4.0375 4.0364 4.0353 4.0344];

%# get coordinates of all the points
nPoints = length(curve);
allCoord = [1:nPoints;curve]';              %'# SO formatting

%# pull out first point
firstPoint = allCoord(1,:);

%# get vector between first and last point - this is the line
lineVec = allCoord(end,:) - firstPoint;

%# normalize the line vector
lineVecN = lineVec / sqrt(sum(lineVec.^2));

%# find the distance from each point to the line:
%# vector between all points and first point
vecFromFirst = bsxfun(@minus, allCoord, firstPoint);

%# To calculate the distance to the line, we split vecFromFirst into two 
%# components, one that is parallel to the line and one that is perpendicular 
%# Then, we take the norm of the part that is perpendicular to the line and 
%# get the distance.
%# We find the vector parallel to the line by projecting vecFromFirst onto 
%# the line. The perpendicular vector is vecFromFirst - vecFromFirstParallel
%# We project vecFromFirst by taking the scalar product of the vector with 
%# the unit vector that points in the direction of the line (this gives us 
%# the length of the projection of vecFromFirst onto the line). If we 
%# multiply the scalar product by the unit vector, we have vecFromFirstParallel
scalarProduct = dot(vecFromFirst, repmat(lineVecN,nPoints,1), 2);
vecFromFirstParallel = scalarProduct * lineVecN;
vecToLine = vecFromFirst - vecFromFirstParallel;

%# distance to line is the norm of vecToLine
distToLine = sqrt(sum(vecToLine.^2,2));

%# plot the distance to the line
figure('Name','distance from curve to line'), plot(distToLine)

%# now all you need is to find the maximum
[maxDist,idxOfBestPoint] = max(distToLine);

%# plot
figure, plot(curve)
hold on
plot(allCoord(idxOfBestPoint,1), allCoord(idxOfBestPoint,2), 'or')

score 27 · Accepted Answer

如果有人需要上面Jonas发布的Matlab代码的工作Python版本。

import numpy as np
curve = [8.4663, 8.3457, 5.4507, 5.3275, 4.8305, 4.7895, 4.6889, 4.6833, 4.6819, 4.6542, 4.6501, 4.6287, 4.6162, 4.585, 4.5535, 4.5134, 4.474, 4.4089, 4.3797, 4.3494, 4.3268, 4.3218, 4.3206, 4.3206, 4.3203, 4.2975, 4.2864, 4.2821, 4.2544, 4.2288, 4.2281, 4.2265, 4.2226, 4.2206, 4.2146, 4.2144, 4.2114, 4.1923, 4.19, 4.1894, 4.1785, 4.178, 4.1694, 4.1694, 4.1694, 4.1556, 4.1498, 4.1498, 4.1357, 4.1222, 4.1222, 4.1217, 4.1192, 4.1178, 4.1139, 4.1135, 4.1125, 4.1035, 4.1025, 4.1023, 4.0971, 4.0969, 4.0915, 4.0915, 4.0914, 4.0836, 4.0804, 4.0803, 4.0722, 4.065, 4.065, 4.0649, 4.0644, 4.0637, 4.0616, 4.0616, 4.061, 4.0572, 4.0563, 4.056, 4.0545, 4.0545, 4.0522, 4.0519, 4.0514, 4.0484, 4.0467, 4.0463, 4.0422, 4.0392, 4.0388, 4.0385, 4.0385, 4.0383, 4.038, 4.0379, 4.0375, 4.0364, 4.0353, 4.0344]
nPoints = len(curve)
allCoord = np.vstack((range(nPoints), curve)).T
np.array([range(nPoints), curve])
firstPoint = allCoord[0]
lineVec = allCoord[-1] - allCoord[0]
lineVecNorm = lineVec / np.sqrt(np.sum(lineVec**2))
vecFromFirst = allCoord - firstPoint
scalarProduct = np.sum(vecFromFirst * np.matlib.repmat(lineVecNorm, nPoints, 1), axis=1)
vecFromFirstParallel = np.outer(scalarProduct, lineVecNorm)
vecToLine = vecFromFirst - vecFromFirstParallel
distToLine = np.sqrt(np.sum(vecToLine ** 2, axis=1))
idxOfBestPoint = np.argmax(distToLine)

score 10 · Accepted Answer

这是 Jonas 在 R 中实现的解决方案：

elbow_finder <- function(x_values, y_values) {
  # Max values to create line
  max_x_x <- max(x_values)
  max_x_y <- y_values[which.max(x_values)]
  max_y_y <- max(y_values)
  max_y_x <- x_values[which.max(y_values)]
  max_df <- data.frame(x = c(max_y_x, max_x_x), y = c(max_y_y, max_x_y))

  # Creating straight line between the max values
  fit <- lm(max_df$y ~ max_df$x)

  # Distance from point to line
  distances <- c()
  for(i in 1:length(x_values)) {
    distances <- c(distances, abs(coef(fit)[2]*x_values[i] - y_values[i] + coef(fit)[1]) / sqrt(coef(fit)[2]^2 + 1^2))
  }

  # Max distance point
  x_max_dist <- x_values[which.max(distances)]
  y_max_dist <- y_values[which.max(distances)]

  return(c(x_max_dist, y_max_dist))
}

score 8 · Accepted Answer

首先，快速回顾一下微积分：每个图的一阶导数表示被绘制f'的函数变化的速率。f二阶导数f''表示变化的速率f'。如果f''很小，则表示图形正在以适度的速度改变方向。但如果f''很大，则意味着图形正在快速改变方向。

您想要隔离f''图形域中最大的点。这些将是为您的最佳模型选择的候选点。你选择哪一点取决于你，因为你没有具体说明你在多大程度上看重适合性和复杂性。

score 8 · Accepted Answer

信息论模型选择的重点是它已经考虑了参数的数量。因此，不需要找到肘部，您只需找到最小值即可。

仅在使用 fit 时查找曲线的肘部才有意义。即使这样，您选择选择肘部的方法在某种意义上也是对参数数量的惩罚。要选择弯头，您需要最小化从原点到曲线的距离。距离计算中两个维度的相对权重会产生一个固有的惩罚项。信息论标准根据参数数量和用于估计模型的数据样本数量来设置此指标。

底线建议：使用 BIC 并取最小值。

score 5 · Accepted Answer

所以解决这个问题的一种方法是在你的肘部L上放两条线。但是由于曲线的一部分只有几个点（正如我在评论中提到的那样），线拟合会受到影响，除非您检测到哪些点被间隔开并在它们之间插值以制造更统一的系列，然后使用 RANSAC找到适合L的两条线- 有点复杂但并非不可能。

因此，这里有一个更简单的解决方案 - 由于 MATLAB 的缩放（显然），您制作的图表看起来像他们所做的那样。所以我所做的就是使用比例信息最小化从“原点”到您的点的距离。

请注意：原点估计可以显着改善，但我将把它留给你。

这是代码：

%% Order
curve = [8.4663 8.3457 5.4507 5.3275 4.8305 4.7895 4.6889 4.6833 4.6819 4.6542 4.6501 4.6287 4.6162 4.585 4.5535 4.5134 4.474 4.4089 4.3797 4.3494 4.3268 4.3218 4.3206 4.3206 4.3203 4.2975 4.2864 4.2821 4.2544 4.2288 4.2281 4.2265 4.2226 4.2206 4.2146 4.2144 4.2114 4.1923 4.19 4.1894 4.1785 4.178 4.1694 4.1694 4.1694 4.1556 4.1498 4.1498 4.1357 4.1222 4.1222 4.1217 4.1192 4.1178 4.1139 4.1135 4.1125 4.1035 4.1025 4.1023 4.0971 4.0969 4.0915 4.0915 4.0914 4.0836 4.0804 4.0803 4.0722 4.065 4.065 4.0649 4.0644 4.0637 4.0616 4.0616 4.061 4.0572 4.0563 4.056 4.0545 4.0545 4.0522 4.0519 4.0514 4.0484 4.0467 4.0463 4.0422 4.0392 4.0388 4.0385 4.0385 4.0383 4.038 4.0379 4.0375 4.0364 4.0353 4.0344];
x_axis = 1:numel(curve);
points = [x_axis ; curve ]'; %' - SO formatting

%% Get the scaling info
f = figure(1);
plot(points(:,1),points(:,2));
ticks = get(get(f,'CurrentAxes'),'YTickLabel');
ticks = str2num(ticks);
aspect = get(get(f,'CurrentAxes'),'DataAspectRatio');
aspect = [aspect(2) aspect(1)];    
close(f);   

%% Get the "origin"
O = [x_axis(1) ticks(1)];

%% Scale the data - now the scaled values look like MATLAB''s idea of
% what a good plot should look like
scaled_O = O.*aspect;
scaled_points = bsxfun(@times,points,aspect);

%% Find the closest point
del = sum((bsxfun(@minus,scaled_points,scaled_O).^2),2);
[val ind] = min(del);
best_ROC = [ind curve(ind)];

%% Display
plot(x_axis,curve,'.-');
hold on;
plot(O(1),O(2),'r*');
plot(best_ROC(1),best_ROC(2),'k*');

结果：

同样对于Fit(maximize)曲线，您必须将原点更改为[x_axis(1) ticks(end)].

score 3 · Accepted Answer

以一种简单直观的方式，我们可以说

如果我们从曲线上的任何一点到曲线的两个端点画两条线，这两条线以度为单位形成最小角度的点就是所需的点。

在这里，两条线可以被想象成手臂，点被想象成肘点！

score 2 · Accepted Answer

双重派生方法。但是，它似乎不适用于嘈杂的数据。对于输出，您只需找到 d2 的最大值即可识别肘部。此实现在 R 中。

elbow_finder <- function(x_values, y_values) {
  i_max <- length(x_values) - 1

  # First and second derived
  first_derived <- list()
  second_derived <- list()

  # First derived
  for(i in 2:i_max){
    slope1 <- (y_values[i+1] - y_values[i]) / (x_values[i+1] - x_values[i])
    slope2 <- (y_values[i] - y_values[i-1]) / (x_values[i] - x_values[i-1])
    slope_avg <- (slope1 + slope2) / 2
    first_derived[[i]] <- slope_avg 
  }
  first_derived[[1]] <- NA
  first_derived[[i_max+1]] <- NA
  first_derived <- unlist(first_derived)

  # Second derived
  for(i in 3:i_max-1){
    d1 <- (first_derived[i+1] - first_derived[i]) / (x_values[i+1] - x_values[i])
    d2 <- (first_derived[i] - first_derived[i-1]) / (x_values[i] - x_values[i-1])
    d_avg <- (d1 + d2) / 2
    second_derived[[i]] <- d_avg 
  }
  second_derived[[1]] <- NA
  second_derived[[2]] <- NA
  second_derived[[i_max]] <- NA
  second_derived[[i_max+1]] <- NA
  second_derived <- unlist(second_derived)

  return(list(d1 = first_derived, d2 = second_derived))
}

score 2 · Accepted Answer

我从事膝/肘点检测已经有一段时间了。无论如何，我是专家。一些可能与此问题相关的方法。

DFDT 代表动态一阶导数阈值。它计算一阶导数并使用阈值算法来检测膝/肘点。DSDT 类似但使用二阶导数，我的评估表明它们具有相似的性能。

S 方法是 L 方法的扩展。L 方法将两条直线拟合到曲线上，两条线之间的交点是膝/肘点。通过循环整体点、拟合线并评估 MSE（均方误差）来找到最佳拟合。S 方法拟合 3 条直线，这提高了准确性，但也需要更多的计算。

我所有的代码都在GitHub 上公开可用。此外，本文可以帮助您找到有关该主题的更多信息。它只有四页长，所以应该很容易阅读。您可以使用代码，如果您想讨论任何方法，请随意这样做。

score 1 · Accepted Answer

如果你愿意，我已经将它翻译成 R 作为我自己的练习（请原谅我未优化的编码风格）。*应用它来找到 k-means 上的最佳集群数 - 工作得很好。

elbow.point = function(x){
elbow.curve = c(x)
nPoints = length(elbow.curve);
allCoord = cbind(c(1:nPoints),c(elbow.curve))
# pull out first point
firstPoint = allCoord[1,]
# get vector between first and last point - this is the line
lineVec = allCoord[nPoints,] - firstPoint;
# normalize the line vector
lineVecN = lineVec / sqrt(sum(lineVec^2));
# find the distance from each point to the line:
# vector between all points and first point
vecFromFirst = lapply(c(1:nPoints), function(x){
  allCoord[x,] - firstPoint
})
vecFromFirst = do.call(rbind, vecFromFirst)
rep.row<-function(x,n){
  matrix(rep(x,each=n),nrow=n)
}
scalarProduct = matrix(nrow = nPoints, ncol = 2)
scalarProduct[,1] = vecFromFirst[,1] * rep.row(lineVecN,nPoints)[,1]
scalarProduct[,2] = vecFromFirst[,2] * rep.row(lineVecN,nPoints)[,2]
scalarProduct = as.matrix(rowSums(scalarProduct))
vecFromFirstParallel = matrix(nrow = nPoints, ncol = 2)
vecFromFirstParallel[,1] = scalarProduct * lineVecN[1]
vecFromFirstParallel[,2] = scalarProduct * lineVecN[2]
vecToLine = lapply(c(1:nPoints), function(x){
  vecFromFirst[x,] - vecFromFirstParallel[x,]
})
vecToLine = do.call(rbind, vecToLine)
# distance to line is the norm of vecToLine
distToLine = as.matrix(sqrt(rowSums(vecToLine^2)))
##
which.max(distToLine)
}

函数的输入 x 应该是带有您的值的列表/向量

score 0 · Accepted Answer

不要忽视模型选择的 k 折交叉验证，它是 AIC/BIC 的绝佳替代方案。还要考虑您正在建模的潜在情况，您可以使用领域知识来帮助选择模型。

algorithm - 在曲线上找到最佳权衡点

编辑

11 回答 11

Related

Reference