r - 面板数据的双聚类标准误

Question

我在 R（时间和横截面）中有一个面板数据集，并且想计算按二维聚类的标准误差，因为我的残差是双向相关的。谷歌搜索我发现http://thetarzan.wordpress.com/2011/06/11/clustered-standard-errors-in-r/它提供了一个功能来做到这一点。这似乎有点特别，所以我想知道是否有一个包已经过测试并且这样做了吗？

我知道sandwichHAC 标准错误，但它不做双重聚类（即沿二维）。

score 10 · Accepted Answer

这是一个老问题。但是看到人们似乎仍在登陆它，我想我会提供一些现代方法来在 R 中进行多路聚类：

选项 1（最快）：`fixest::feols()`

library(fixest)

nlswork = haven::read_dta("http://www.stata-press.com/data/r14/nlswork.dta")

est_feols = feols(ln_wage ~ age | race + year, cluster = ~race+year, data = nlswork)
est_feols

## An important feature of fixest: We can _instantaneously_ compute other
## VCOV matrices / SEs on the fly with summary.fixest(). No need to re-run
## the model!
summary(est_feols, se = 'standard') ## vanilla SEs
summary(est_feols, se = 'hetero') ## robust SEs
summary(est_feols, se = 'twoway') ## diff syntax, but same as original model
summary(est_feols, cluster = c('race', 'year')) ## ditto
summary(est_feols, cluster = ~race^year) ## interacted cluster vars
summary(est_feols, cluster = ~ race + year + idcode)  ## add third cluster var (not in original model call)
## etc.

选项 2（快速）：`lfe::felm()`

library(lfe)

## Note the third "| 0 " slot means we're not using IV

est_felm = felm(ln_wage ~ age | race + year | 0 | race + year, data = nlswork)
summary(est_felm)

选项 3（较慢，但灵活）：`sandwich`

library(sandwich)
library(lmtest)

est_sandwich = lm(ln_wage ~ age + factor(race) + factor(year), data = nlswork) 
coeftest(est_sandwich, vcov = vcovCL, cluster = ~ race + year)

基准

Aaaand，只是为了强调关于速度的观点。这是三种不同方法的基准（使用两个固定的 FE 和双向聚类）。

est_feols = function() feols(ln_wage ~ age | race + year, cluster = ~race+year, data = nlswork) 
est_felm = function() felm(ln_wage ~ age | race + year | 0 | race + year, data = nlswork)
est_standwich = function() {coeftest(lm(ln_wage ~ age + factor(race) + factor(year), data = nlswork), 
                                     vcov = vcovCL, cluster = ~ race + year)}

microbenchmark(est_feols(), est_felm(), est_standwich(), times = 3)

#> Unit: milliseconds
#>             expr       min        lq      mean    median        uq       max neval cld
#>      est_feols()  11.94122  11.96158  12.55835  11.98193  12.86692  13.75191     3 a  
#>       est_felm()  87.18064  95.89905 100.69589 104.61746 107.45352 110.28957     3  b 
#>  est_standwich() 176.43502 183.95964 188.50271 191.48425 194.53656 197.58886     3   c

score 7 · Accepted Answer

对于面板回归，该plm软件包可以沿两个维度估计聚类 SE。

使用M. Petersen 的基准测试结果：

require(foreign)
require(plm)
require(lmtest)
test <- read.dta("http://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/test_data.dta")

##Double-clustering formula (Thompson, 2011)
vcovDC <- function(x, ...){
    vcovHC(x, cluster="group", ...) + vcovHC(x, cluster="time", ...) - 
        vcovHC(x, method="white1", ...)
}

fpm <- plm(y ~ x, test, model='pooling', index=c('firmid', 'year'))

因此，现在您可以获得集群 SE：

##Clustered by *group*
> coeftest(fpm, vcov=function(x) vcovHC(x, cluster="group", type="HC1"))

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.029680   0.066952  0.4433   0.6576    
x           1.034833   0.050550 20.4714   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

##Clustered by *time*
> coeftest(fpm, vcov=function(x) vcovHC(x, cluster="time", type="HC1"))

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.029680   0.022189  1.3376   0.1811    
x           1.034833   0.031679 32.6666   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

##Clustered by *group* and *time*
> coeftest(fpm, vcov=function(x) vcovDC(x, type="HC1"))

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.029680   0.064580  0.4596   0.6458    
x           1.034833   0.052465 19.7243   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

有关更多详细信息，请参阅：

R 中的 Fama-MacBeth 和 Cluster-Robust（按公司和时间）标准错误。

但是，仅当您的数据可以强制转换为pdata.frame. 如果你有它会失败"duplicate couples (time-id)"。在这种情况下，您仍然可以聚类，但只能沿着一个维度。

通过仅指定一个索引来欺骗plm您认为您拥有正确的面板数据集：

fpm.tr <- plm(y ~ x, test, model='pooling', index=c('firmid'))

因此，现在您可以获得集群 SE：

##Clustered by *group*
> coeftest(fpm.tr, vcov=function(x) vcovHC(x, cluster="group", type="HC1"))

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.029680   0.066952  0.4433   0.6576    
x           1.034833   0.050550 20.4714   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

您还可以使用此解决方法按更高维度或更高级别（例如industry或country）进行聚类。但是，在这种情况下，您将无法使用group(or time) effects，这是该方法的主要限制。

另一种适用于面板和其他类型数据的方法是multiwayvcov包。它允许双重聚类，但也允许在更高维度进行聚类。根据包的网站，这是对 Arai 代码的改进：

由于缺失而丢失的观察结果的透明处理

全多路（或n路，或n维，或多维）聚类

使用彼得森数据和cluster.vcov()：

library("lmtest")
library("multiwayvcov")

data(petersen)
m1 <- lm(y ~ x, data = petersen)

coeftest(m1, vcov=function(x) cluster.vcov(x, petersen[ , c("firmid", "year")]))
## 
## t test of coefficients:
## 
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.029680   0.065066  0.4561   0.6483    
## x           1.034833   0.053561 19.3206   <2e-16 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

score 5 · Accepted Answer

Arai 的函数可用于对标准错误进行聚类。他有另一个用于多维度聚类的版本：

mcl <- function(dat,fm, cluster1, cluster2){
          attach(dat, warn.conflicts = F)
          library(sandwich);library(lmtest)
          cluster12 = paste(cluster1,cluster2, sep="")
          M1  <- length(unique(cluster1))
          M2  <- length(unique(cluster2))   
          M12 <- length(unique(cluster12))
          N   <- length(cluster1)          
          K   <- fm$rank             
          dfc1  <- (M1/(M1-1))*((N-1)/(N-K))  
          dfc2  <- (M2/(M2-1))*((N-1)/(N-K))  
          dfc12 <- (M12/(M12-1))*((N-1)/(N-K))  
          u1j   <- apply(estfun(fm), 2, function(x) tapply(x, cluster1,  sum)) 
          u2j   <- apply(estfun(fm), 2, function(x) tapply(x, cluster2,  sum)) 
          u12j  <- apply(estfun(fm), 2, function(x) tapply(x, cluster12, sum)) 
          vc1   <-  dfc1*sandwich(fm, meat=crossprod(u1j)/N )
          vc2   <-  dfc2*sandwich(fm, meat=crossprod(u2j)/N )
          vc12  <- dfc12*sandwich(fm, meat=crossprod(u12j)/N)
          vcovMCL <- vc1 + vc2 - vc12
          coeftest(fm, vcovMCL)}

有关参考和使用示例，请参见：

R中的聚集标准错误

score 4 · Accepted Answer

Frank Harrell 的包rms（以前称为Design）有一个我在聚类时经常使用的功能：robcov.

例如，参见的这一部分?robcov。

cluster: a variable indicating groupings. ‘cluster’ may be any type of
      vector (factor, character, integer).  NAs are not allowed.
      Unique values of ‘cluster’ indicate possibly correlated
      groupings of observations. Note the data used in the fit and
      stored in ‘fit$x’ and ‘fit$y’ may have had observations
      containing missing values deleted. It is assumed that if any
      NAs were removed during the original model fitting, an
      ‘naresid’ function exists to restore NAs so that the rows of
      the score matrix coincide with ‘cluster’. If ‘cluster’ is
      omitted, it defaults to the integers 1,2,...,n to obtain the
      "sandwich" robust covariance matrix estimate.

r - 面板数据的双聚类标准误

4 回答 4

选项 1（最快）：fixest::feols()

选项 2（快速）：lfe::felm()

选项 3（较慢，但灵活）：sandwich

基准

Related

Reference

选项 1（最快）：`fixest::feols()`

选项 2（快速）：`lfe::felm()`

选项 3（较慢，但灵活）：`sandwich`