python - Python Pandas中加权最小二乘的意外标准误差

Question

在Python Pandas 中主要 OLS 类的代码中，我正在寻求帮助，以阐明在执行加权 OLS 时报告的标准错误和 t-stats 使用了哪些约定。

这是我的示例数据集，其中包含一些使用 Pandas 和直接使用 scikits.statsmodels WLS 的导入：

import pandas
import numpy as np
from statsmodels.regression.linear_model import WLS

# Make some random data.
np.random.seed(42)
df = pd.DataFrame(np.random.randn(10, 3), columns=['a', 'b', 'weights'])

# Add an intercept term for direct use in WLS
df['intercept'] = 1 

# Add a number (I picked 10) to stabilize the weight proportions a little.
df['weights'] = df.weights + 10

# Fit the regression models.
pd_wls = pandas.ols(y=df.a, x=df.b, weights=df.weights)
sm_wls = WLS(df.a, df[['intercept','b']], weights=df.weights).fit()

我使用%cpaste在 IPython 中执行此操作，然后打印两个回归的摘要：

In [226]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:import pandas
:import numpy as np
:from statsmodels.regression.linear_model import WLS
:
:# Make some random data.
np:np.random.seed(42)
:df = pd.DataFrame(np.random.randn(10, 3), columns=['a', 'b', 'weights'])
:
:# Add an intercept term for direct use in WLS
:df['intercept'] = 1
:
:# Add a number (I picked 10) to stabilize the weight proportions a little.
:df['weights'] = df.weights + 10
:
:# Fit the regression models.
:pd_wls = pandas.ols(y=df.a, x=df.b, weights=df.weights)
:sm_wls = WLS(df.a, df[['intercept','b']], weights=df.weights).fit()
:--

In [227]: pd_wls
Out[227]:

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x> + <intercept>

Number of Observations:         10
Number of Degrees of Freedom:   2

R-squared:         0.2685
Adj R-squared:     0.1770

Rmse:              2.4125

F-stat (1, 8):     2.9361, p-value:     0.1250

Degrees of Freedom: model 1, resid 8

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x     0.5768     1.0191       0.57     0.5869    -1.4206     2.5742
     intercept     0.5227     0.9079       0.58     0.5806    -1.2567     2.3021
---------------------------------End of Summary---------------------------------


In [228]: sm_wls.summ
sm_wls.summary      sm_wls.summary_old

In [228]: sm_wls.summary()
Out[228]:
<class 'statsmodels.iolib.summary.Summary'>
"""
                            WLS Regression Results
==============================================================================
Dep. Variable:                      a   R-squared:                       0.268
Model:                            WLS   Adj. R-squared:                  0.177
Method:                 Least Squares   F-statistic:                     2.936
Date:                Wed, 17 Jul 2013   Prob (F-statistic):              0.125
Time:                        15:14:04   Log-Likelihood:                -10.560
No. Observations:                  10   AIC:                             25.12
Df Residuals:                       8   BIC:                             25.72
Df Model:                           1
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
intercept      0.5227      0.295      1.770      0.115        -0.158     1.204
b              0.5768      0.333      1.730      0.122        -0.192     1.346
==============================================================================
Omnibus:                        0.967   Durbin-Watson:                   1.082
Prob(Omnibus):                  0.617   Jarque-Bera (JB):                0.622
Skew:                           0.003   Prob(JB):                        0.733
Kurtosis:                       1.778   Cond. No.                         1.90
==============================================================================
"""

注意不匹配的标准错误：Pandas 声称标准错误是[0.9079, 1.0191]，而 statsmodels 说[0.295, 0.333].

回到我在帖子顶部链接的代码中，我试图追踪不匹配的来源。

首先，您可以看到标准错误是函数报告的：

def _std_err_raw(self):
    """Returns the raw standard err values."""
    return np.sqrt(np.diag(self._var_beta_raw))

所以看着self._var_beta_raw我发现：

def _var_beta_raw(self):
    """
    Returns the raw covariance of beta.
    """
    x = self._x.values
    y = self._y.values

    xx = np.dot(x.T, x)

    if self._nw_lags is None:
        return math.inv(xx) * (self._rmse_raw ** 2)
    else:
        resid = y - np.dot(x, self._beta_raw)
        m = (x.T * resid).T

        xeps = math.newey_west(m, self._nw_lags, self._nobs, self._df_raw,
                               self._nw_overlap)

        xx_inv = math.inv(xx)
        return np.dot(xx_inv, np.dot(xeps, xx_inv))

在我的用例中，self._nw_lags总是None如此，所以令人费解的是第一部分。由于xx只是回归矩阵的标准乘积：x.T.dot(x)，我想知道权重如何影响这一点。该术语self._rmse_raw直接来自于的构造函数中拟合的 statsmodels 回归OLS，因此绝对包含权重。

这会提示这些问题：

为什么在 RMSE 部分中使用权重报告标准误差，而不是在回归变量中。
如果您想要“未转换”的变量，这是标准做法吗（您难道不也想要未转换的 RMSE 吗？？）有没有办法让 Pandas 返回标准误差的全加权版本？
为什么所有的误导？在构造函数中，计算了完整的 statsmodels 拟合回归。为什么绝对不是每个汇总统计数据都直接来自那里？为什么要混搭，有的来自 statsmodels 输出，有的来自 Pandas 家常计算？

看起来我可以通过执行以下操作来协调 Pandas 输出：

In [238]: xs = df[['intercept', 'b']]

In [239]: trans_xs = xs.values * np.sqrt(df.weights.values[:,None])

In [240]: trans_xs
Out[240]:
array([[ 3.26307961, -0.45116742],
       [ 3.12503809, -0.73173821],
       [ 3.08715494,  2.36918991],
       [ 3.08776136, -1.43092325],
       [ 2.87664425, -5.50382662],
       [ 3.21158019, -3.25278836],
       [ 3.38609639, -4.78219647],
       [ 2.92835309,  0.19774643],
       [ 2.97472796,  0.32996453],
       [ 3.1158155 , -1.87147934]])

In [241]: np.sqrt(np.diag(np.linalg.inv(trans_xs.T.dot(trans_xs)) * (pd_wls._rmse_raw ** 2)))
Out[241]: array([ 0.29525952,  0.33344823])

我只是对这种关系感到非常困惑。这在统计学家中是不是很常见：将权重与 RMSE 部分结合起来，然后在计算系数的标准误差时选择是否对变量进行加权？如果是这样的话，为什么 Pandas 和 statsmodels 之间的系数本身也不同，因为这些系数同样是从首先由 statsmodels 转换的变量导出的？

作为参考，这是我的玩具示例中使用的完整数据集（以防万一np.random.seed不足以使其可重现）：

In [242]: df
Out[242]:
          a         b    weights  intercept
0  0.496714 -0.138264  10.647689          1
1  1.523030 -0.234153   9.765863          1
2  1.579213  0.767435   9.530526          1
3  0.542560 -0.463418   9.534270          1
4  0.241962 -1.913280   8.275082          1
5 -0.562288 -1.012831  10.314247          1
6 -0.908024 -1.412304  11.465649          1
7 -0.225776  0.067528   8.575252          1
8 -0.544383  0.110923   8.849006          1
9  0.375698 -0.600639   9.708306          1

score 5 · Accepted Answer

此处不直接回答您的问题，但总的来说，您应该更喜欢 statsmodels 代码而不是 pandas 进行建模。最近在 statsmodels 中发现了一些 WLS 问题，现已修复。AFAIK，它们也在 pandas 中得到修复，但在大多数情况下，pandas 建模代码没有得到维护，中期目标是确保 pandas 中可用的所有内容都已弃用并已移至 statsmodels（statsmodels 的下一个版本 0.6.0应该这样做）。

为了更清楚一点，pandas 现在是 statsmodels 的依赖项。您可以将 DataFrames 传递给 statsmodels 或在 statsmodels 中使用公式。这是未来的预期关系。

python - Python Pandas中加权最小二乘的意外标准误差

1 回答 1

Related

Reference