0

我有一个数据集,包括一个长的 x 值数组和一个同样长的 y 值数组。对于每个 (x,y) 对,我想在已知函数 y(x) 上找到最近的点。

我原则上可以遍历每一对并执行最小化,例如 scipy.optimize.cobyla,但在 python 中循环很慢。Scipy 的 odr 包看起来很有趣,但我不知道如何让它简单地返回正交向量而不最小化整个事物(将最大迭代“maxit”设置为零并不能满足我的需求)。

有没有一种简单的方法可以使用 numpy 数组的速度来完成这项工作?

4

2 回答 2

1

答案很简单:

  1. 不要循环列表中的点
  2. 循环遍历函数曲线上的点。

我冒昧地将您的函数 y(x) 重命名为 f(z) 以避免混淆。

import numpy as np

# x and y are your numpy arrays of point coords
x = np.array([1,2])
y = np.array([3,4])
# this is your "y(x)" function
def f(z):
    return z**2

xmin = x.min()
xmax = x.max()
step = 0.01 # choose your step at the precision you want

# find distances to every point
zpoints = np.arange(xmin,xmax,step)
distances_squared = np.array([(y-f(z))**2+(x-z)**2 for z in zpoints])

# find z coords of closest points
zmin = zpoints[distances_squared.argmin(axis=0)]
fmin = np.array([f(z) for z in zmin])

for i in range(len(x)):
    print("point on the curve {},{} is closest to {},{}".format(zmin[i],fmin[i],x[i],y[i]))

曲线上的点 1.6700000000000006,2.788900000000002 最接近 1,3

曲线上的点 1.9900000000000009,3.9601000000000033 最接近 2,4

于 2017-10-19T08:01:42.153 回答
1

有一种方法可以加快 Hennadii Madan 的方法,即要求 numpy 执行循环而不是 python。像往常一样,这是以额外的 RAM 为代价的。

下面是我现在用于 2d 的函数。一个很好的特性是它是对称的——可以交换数据集并且计算时间是相同的。

def find_nearests_2d(x1, y1, x2, y2):
   """
   Given two data sets d1 = (x1, y1) and d2 = (x2, y2), return the x,y pairs
   from d2 that are closest to each pair from x1, the difference vectors, and
   the d2 indices of these closest points. 

   Parameters
   ----------
   x1
       1D array of x-values for data set 1.
   y1  
       1D array of y-values for data set 1 (must match size of x1).
   x2
       1D array of x-values for data set 2.
   y2
       1D array of y-values for data set 2 (must match size of x2).

   Returns x2mins, y2mins, xdiffs, ydiffs, indices
   -------
   x2mins
       1D array of minimum-distance x-values from data set 2. One value for each x1.
   y2mins
       1D array of minimum-distance y-values from data set 2. One value for each y1.
   xdiffs 
       1D array of differences in x. One value for each x1.
   ydiffs
       1D array of differences in y. One value for each y1.
   indices
       Indices of each minimum-distance point in data set 2. One for each point in
       data set 1.
   """

   # Generate every combination of points for subtracting
   x1s, x2s = _n.meshgrid(x1, x2)
   y1s, y2s = _n.meshgrid(y1, y2)

   # Calculate all the differences
   dx = x1s - x2s
   dy = y1s - y2s
   d2 = dx**2 + dy**2

   # Find the index of the minimum for each data point
   n = _n.argmin(d2, 0)

   # Index for extracting from the meshgrids
   m = range(len(n))

   return x2s[n,m], y2s[n,m], dx[n,m], dy[n,m], d2[n,m], n

然后还可以使用它来快速估计 x,y 对与函数之间的距离:

def find_nearests_function(x, y, f, *args, fpoints=1000):
    """
    Takes a data set (arrays of x and y values), and a function f(x, *args),
    then estimates the points on the curve f(x) that are closest to each of 
    the data set's x,y pairs.

    Parameters
    ----------
    x
        1D array of x-values for data set 1.
    y  
        1D array of y-values for data set 1 (must match size of x).
    f
        A function of the form f(x, *args) with optional additional arguments.
    *args
        Optional additional arguments to send to f (after argument x).
    fpoints=1000
        Number of evenly-spaced points to search in the x-domain (automatically
        the maximum possible range).

    """

    # Make sure everything is a numpy array
    x = _n.array(x)
    y = _n.array(y)

    # First figure out the range we need for f. Since the function is single-
    # valued, we can put bounds on the x-range: for each point, calculate the 
    # y-distance, and subtract / add this to the x-values
    dys  = _n.abs(f(x)-y)
    xmin = min(x-dys)
    xmax = max(x+dys)

    # Get "dense" function arrays
    xf = _n.linspace(xmin, xmax, fpoints)
    yf = f(xf,*args)

    # Find all the minima
    xfs, yfs, dxs, dys, d2s, n = find_nearests_2d(x, y, xf, yf)

    # Return this info plus the function arrays used
    return xfs, yfs, dxs, dys, d2s, n, xf, yf

如果这是正交距离回归的一部分(如在我的情况下),则差异 dx 和 dy 可以很容易地通过误差条数据集进行缩放,而无需太多开销,因此返回的距离是学生化(无单位)残差。

最终,这种“均匀搜索各处”技术只会让你接近,如果函数在 x 数据范围内不是特别平滑,则会失败。

快速测试代码:

x  = [1,2,5]
y  = [1,-1,1]

def f(x): return _n.cos(x)

fxmin, fymin, dxmin, dymin, d2min, n, xf, yf = find_nearests_function(x, y, f)

import pylab
pylab.plot(x,y, marker='o', ls='', color='m', label='input points')
pylab.plot(xf,yf, color='b', label='function')
pylab.plot(fxmin,fymin, marker='o', ls='', color='r', label='nearest points')
pylab.legend()
pylab.show()

生产

在此处输入图像描述

于 2017-10-24T18:18:06.793 回答