python - Scipy - 优化。查找两个变量之间的比率

Question

我有 3 个变量；Market_Price、小时、年龄。

使用优化我发现了每个变量和 Market_Price 之间的关系。

数据：

hours =  [1000,  10000,  11000,  11000,  15000,  18000,  37000,  24000,  28000,  28000,  42000,  46000,  50000,  34000,  34000,  46000,  50000,  56000,  64000,  64000,  65000,  80000,  81000,  81000,  44000,  49000,  76000,  76000,  89000,  38000,  80000,  69000,  46000,  47000,  57000,  72000,  77000,  68000]

market_Price =  [30945,  28974,  27989,  27989,  36008,  24780,  22980,  23997,  25957,  27847,  36000,  25588,  23980,  25990,  25990,  28995,  26770,  26488,  24988,  24988,  17574,  12995,  19788,  20488,  19980,  24978,  16000,  16400,  18988,  19980,  18488,  16988,  15000,  15000,  16998,  17499,  15780,  8400]

age =  [2,  2,  2,  2,  2,  2,  2,  3,  3,  3,  3,  3,  3,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  5,  5,  5,  5,  5,  6,  6,  7,  8,  8,  8,  8,  8,  13,]

我得出的关系是：

到 market_price 的小时数 = log(h)*h1+h2,

年龄到市场价格 = log(a)*a1+a2

其中 h1、h2、a1、a2 是使用 Scipy 的优化曲线拟合找到的。

现在我想将所有 3 项合并到一个计算中，根据年龄和时间，我可以确定 market_price。

到目前为止，我一直这样做的方法是通过确定哪个组合具有最小的标准偏差来找到两者之间的比率。

std_divs = []
for ratio in ratios:    
    n = 0
    price_difference_final = []
    while n < len(prices):
        predicted_price = (log(h)*h1+h1)*ratio + (log(a)*a1+a1)*(1-ratio)
        price_difference_final.append(prices[n] - predicted_price)
        n += 1
    data = np.array(price_difference_final)
    std_divs.append(np.std(data))
std_div = min(std_divs)
optimum_ratio = ratios[std_divs.index(min(std_divs))]

如您所见，我通过蛮力来完成此操作，这不是一个优雅的解决方案。

此外，现在我发现3之间的关系不能用单一的比率来表示，而是需要滑动比率。随着年份的增加，小时/年龄比率会降低，从而使年龄在市场价格方面的权重越来越大。

不幸的是，我无法使用 Scipy 的曲线拟合来实现这一点，因为它只接受一对数组。

有没有想过如何最好地实现这一目标？

score 2 · Accepted Answer

这是多元回归问题，您不需要编写自己的代码，因为它已经存在：

http://wiki.scipy.org/Cookbook/OLS

注意：最后你没有 5 个参数，h1, h2, a1, a2, ratio. 你只有三个：h2*ratio+a2*(1-ratio) h1*ratio a1*(1-ratio)

In [26]:

y=np.array(market_Price)
x=np.log(np.array([hours, age])).T
In [27]:

mymodel=ols(y, x, 'Market_Price', ['Hours', 'Age'])
In [28]:

mymodel.p # return coefficient p-values
Out[28]:
array([  1.32065700e-05,   3.06318351e-01,   1.34081122e-05])
In [29]:

mymodel.summary()

==============================================================================
Dependent Variable: Market_Price
Method: Least Squares
Date:  Mon, 24 Mar 2014
Time:  15:40:00
# obs:                  38
# variables:         3
==============================================================================
variable     coefficient     std. Error      t-statistic     prob.
==============================================================================
const           45838.261850      9051.125823      5.064371      0.000013
Hours          -1023.097422      985.498239     -1.038152      0.306318
Age            -8862.186475      1751.640834     -5.059363      0.000013
==============================================================================
Models stats                         Residual stats
==============================================================================
R-squared             0.624227         Durbin-Watson stat   1.301026
Adjusted R-squared    0.602754         Omnibus stat         2.999547
F-statistic           29.070664         Prob(Omnibus stat)   0.223181
Prob (F-statistic)    0.000000          JB stat              1.807013
Log likelihood       -366.421766            Prob(JB)             0.405146
AIC criterion         19.443251         Skew                 0.376021
BIC criterion         19.572534         Kurtosis             3.758751
==============================================================================

score 2 · Accepted Answer

可以创建一个多维数组，在这种情况下，您可以将您的hours和age数据都传递到curve_fit. 这样的例子可能是：

import numpy as np
from scipy.optimize import curve_fit

hours =  [1000,  10000,  11000,  11000,  15000,  18000,  37000,  24000,
          28000,  28000,  42000,  46000,  50000,  34000,  34000,  46000,
          50000,  56000,  64000,  64000,  65000,  80000,  81000,  81000,
          44000,  49000,  76000,  76000,  89000,  38000,  80000,  69000,
          46000,  47000,  57000,  72000,  77000,  68000]

market_Price =  [30945,  28974,  27989,  27989,  36008,  24780,  22980,
                 23997,  25957,  27847,  36000,  25588,  23980,  25990,  
                 25990,  28995,  26770,  26488,  24988,  24988,  17574,
                 12995,  19788,  20488,  19980,  24978,  16000,  16400,
                 18988,  19980,  18488,  16988,  15000,  15000,  16998,
                 17499,  15780,  8400]

age =  [2,  2,  2,  2,  2,  2,  2,  3,  3,  3,  3,  3,  3,  4,  4,  4,
        4,  4,  4,  4,  4,  4,  4,  4,  5,  5,  5,  5,  5,  6,  6,  7,  
        8,  8,  8,  8,  8,  13]

combined = np.array([hours, market_Price])

def f():
    # Some function which uses combined where
    # combined[0] = hours and combined[1] = market_Price
    pass

popt, pcov = curve_fit(f, combined, market_Price)

python - Scipy - 优化。查找两个变量之间的比率

2 回答 2

Related

Reference