python - Python中面板数据中的多重共线性

Question

我习惯于使用 Stata 或 R 来做线性回归模型，但我正在将更多的工作流程转移到 Python。

这两个程序的有用之处在于它们直观地知道您并不关心线性模型中的所有实体或时间固定效应，因此在估计面板模型时，它们将从模型中删除多重共线假人（报告哪个他们掉落的那些）。

虽然我知道以这种方式估计模型并不理想，并且应该小心运行回归（等），但这在实践中很有用，因为这意味着您可以首先看到结果，并担心一些细微差别稍后的假人（特别是因为您不关心完全饱和的固定效应模型中的假人）。

让我举个例子。以下需要linearmodels并加载数据集并尝试运行面板回归。这是他们文档中示例的修改版本。

# Load the data (requires statsmodels and linearmodels)
import statsmodels.api as sm
from linearmodels.datasets import wage_panel
import pandas as pd
data = wage_panel.load()
year = pd.Categorical(data.year)
data = data.set_index(['nr', 'year'])
data['year'] = year
print(wage_panel.DESCR)
print(data.head())

# Run the panel regression
from linearmodels.panel import PanelOLS
exog_vars = ['exper','union','married']
exog = sm.add_constant(data[exog_vars])
mod = PanelOLS(data.lwage, exog, entity_effects=True, time_effects=True)
fe_te_res = mod.fit()
print(fe_te_res)

这给出了以下错误：

AbsorbingEffectError：无法估计模型。所包含的效应已经完全吸收了一个或多个变量。当使用模型中包含的效果完美地解释了一个或多个因变量时，就会发生这种情况。

但是，如果您在 Stata 中通过将相同的数据导出到 Stata 进行估计，则运行：

data.drop(columns='year').to_stata('data.dta')

然后在您的 stata 文件中运行等效文件（加载数据后）：

xtset nr year
xtreg lwage exper union married i.year, fe

这将执行以下操作：

> . xtreg lwage exper union married i.year, fe
note: 1987.year omitted because of collinearity

Fixed-effects (within) regression               Number of obs      =      4360
Group variable: nr                              Number of groups   =       545

R-sq:  within  = 0.1689                         Obs per group: min =         8
       between = 0.0000                                        avg =       8.0
       overall = 0.0486                                        max =         8

                                                F(9,3806)          =     85.95
corr(u_i, Xb)  = -0.1747                        Prob > F           =    0.0000

------------------------------------------------------------------------------
       lwage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       exper |   .0638624   .0032594    19.59   0.000     .0574721    .0702527
       union |   .0833697   .0194393     4.29   0.000     .0452572    .1214821
     married |   .0583372   .0183688     3.18   0.002     .0223235    .0943509
             |
        year |
       1981  |   .0496865   .0200714     2.48   0.013     .0103348    .0890382
       1982  |   .0399445    .019123     2.09   0.037     .0024521    .0774369
       1983  |   .0193513    .018662     1.04   0.300    -.0172373    .0559398
       1984  |   .0229574   .0186503     1.23   0.218    -.0136081    .0595229
       1985  |   .0081499   .0191359     0.43   0.670    -.0293677    .0456674
       1986  |   .0036329   .0200851     0.18   0.856    -.0357456    .0430115
       1987  |          0  (omitted)
             |
       _cons |   1.169184   .0231221    50.57   0.000     1.123851    1.214517
-------------+----------------------------------------------------------------
     sigma_u |  .40761229
     sigma_e |  .35343397
         rho |  .57083029   (fraction of variance due to u_i)
------------------------------------------------------------------------------

请注意，stata 从回归中任意删除了 1987 年，但仍然运行。有没有办法在linearmodelsor中获得类似的功能statsmodels？

score 2 · Accepted Answer

我能想到的唯一方法是手动。

样本数据

import pandas as pd
import numpy as np
import statsmodels.api as sm

from linearmodels.datasets import wage_panel
from linearmodels.panel import PanelOLS

data = wage_panel.load()

首先，我们将追随 Stata 的脚步，为每年的固定效应生成虚拟变量，并省略第一个值，按字典顺序排序（由drop_first=True参数完成）。使用np.unique然后获取标签很重要，因为这也是如此。无需statsmodels添加常量，只需自己做：

data = wage_panel.load()
data = pd.concat([data, pd.get_dummies(data.year, drop_first=True)], axis=1)
exog_vars = ['exper','union','married'] + np.unique(data.year)[1::].tolist()
data = data.set_index(['nr', 'year'])

exog = data[exog_vars].assign(constant=1)

如果我们尝试运行这个模型，它会因为完美的共线性而失败。因为我们正在做一个内部回归，所以我们不能只测试 exog 的共线性，我们需要首先在面板内去均值，因为去均值的自变量的共线性很重要。我将在这里复制一份，以免exog我们在最终回归中使用它。

exog2 = exog.copy()
exog2 = exog2 - exog2.groupby(level=0).transform('mean') + exog2.mean()

我们现在可以看到存在完美的共线性；当我们针对所有其他变量对一个内贬值 exog 变量进行回归时，我们会得到一些回归的完美 R 平方：

for col in exog2:
    print(col, sm.OLS(exog2[col], exog2.drop(columns=col)).fit().rsquared)

#exper 1.0
#union 0.004326532427527674
#married 0.18901677578724896
#1981 1.0
#1982 1.0
#1983 1.0
#1984 1.0
#1985 1.0
#1986 1.0
#1987 1.0

现在，Stata 或其他软件包如何决定删除哪个变量对我来说是个谜。在这种情况下，您可能会倾向于将您的年度虚拟变量中的一个放在其他变量上。让我们选择 1987，这样我们最终可以得到与 Stata 相同的结果。

exog2 = exog2.drop(columns=1987)
for col in exog2:
    print(col, sm.OLS(exog2[col], exog2.drop(columns=col)).fit().rsquared)

#exper 0.48631
#union 0.0043265
#married 0.1890
#1981 0.34978
#1982 0.28369
#1983 0.2478680
#1984 0.2469180
#1985 0.2846552
#1986 0.35067075
#constant 0.94641

所以我们已经处理了共线性并且可以回到模型。由于我们手动包含了年度固定效应，我们可以time_effects从模型中删除。

mod = PanelOLS(data.lwage, exog.drop(columns=1987), entity_effects=True, time_effects=False)
print(mod.fit())

                          PanelOLS Estimation Summary                           
================================================================================
Dep. Variable:                  lwage   R-squared:                        0.1689
Estimator:                   PanelOLS   R-squared (Between):             -0.0882
No. Observations:                4360   R-squared (Within):               0.1689
Date:                Sat, Mar 09 2019   R-squared (Overall):              0.0308
Time:                        00:59:14   Log-likelihood                   -1355.7
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                      85.946
Entities:                         545   P-value                           0.0000
Avg Obs:                       8.0000   Distribution:                  F(9,3806)
Min Obs:                       8.0000                                           
Max Obs:                       8.0000   F-statistic (robust):             85.946
                                        P-value                           0.0000
Time periods:                       8   Distribution:                  F(9,3806)
Avg Obs:                       545.00                                           
Min Obs:                       545.00                                           
Max Obs:                       545.00                                           

                             Parameter Estimates                              
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
exper          0.0639     0.0033     19.593     0.0000      0.0575      0.0703
union          0.0834     0.0194     4.2887     0.0000      0.0453      0.1215
married        0.0583     0.0184     3.1759     0.0015      0.0223      0.0944
1981           0.0497     0.0201     2.4755     0.0133      0.0103      0.0890
1982           0.0399     0.0191     2.0888     0.0368      0.0025      0.0774
1983           0.0194     0.0187     1.0369     0.2998     -0.0172      0.0559
1984           0.0230     0.0187     1.2309     0.2184     -0.0136      0.0595
1985           0.0081     0.0191     0.4259     0.6702     -0.0294      0.0457
1986           0.0036     0.0201     0.1809     0.8565     -0.0357      0.0430
constant       1.1692     0.0231     50.566     0.0000      1.1239      1.2145
==============================================================================

与Stata输出相匹配。没有任何理由放弃 1987，我们本可以选择其他东西，但至少这样我们可以看到结果与 xtreg 匹配。

python - Python中面板数据中的多重共线性

1 回答 1

样本数据

Related

Reference