我在 R(时间和横截面)中有一个面板数据集,并且想计算按二维聚类的标准误差,因为我的残差是双向相关的。谷歌搜索我发现http://thetarzan.wordpress.com/2011/06/11/clustered-standard-errors-in-r/它提供了一个功能来做到这一点。这似乎有点特别,所以我想知道是否有一个包已经过测试并且这样做了吗?
我知道sandwich
HAC 标准错误,但它不做双重聚类(即沿二维)。
我在 R(时间和横截面)中有一个面板数据集,并且想计算按二维聚类的标准误差,因为我的残差是双向相关的。谷歌搜索我发现http://thetarzan.wordpress.com/2011/06/11/clustered-standard-errors-in-r/它提供了一个功能来做到这一点。这似乎有点特别,所以我想知道是否有一个包已经过测试并且这样做了吗?
我知道sandwich
HAC 标准错误,但它不做双重聚类(即沿二维)。
这是一个老问题。但是看到人们似乎仍在登陆它,我想我会提供一些现代方法来在 R 中进行多路聚类:
fixest::feols()
library(fixest)
nlswork = haven::read_dta("http://www.stata-press.com/data/r14/nlswork.dta")
est_feols = feols(ln_wage ~ age | race + year, cluster = ~race+year, data = nlswork)
est_feols
## An important feature of fixest: We can _instantaneously_ compute other
## VCOV matrices / SEs on the fly with summary.fixest(). No need to re-run
## the model!
summary(est_feols, se = 'standard') ## vanilla SEs
summary(est_feols, se = 'hetero') ## robust SEs
summary(est_feols, se = 'twoway') ## diff syntax, but same as original model
summary(est_feols, cluster = c('race', 'year')) ## ditto
summary(est_feols, cluster = ~race^year) ## interacted cluster vars
summary(est_feols, cluster = ~ race + year + idcode) ## add third cluster var (not in original model call)
## etc.
lfe::felm()
library(lfe)
## Note the third "| 0 " slot means we're not using IV
est_felm = felm(ln_wage ~ age | race + year | 0 | race + year, data = nlswork)
summary(est_felm)
sandwich
library(sandwich)
library(lmtest)
est_sandwich = lm(ln_wage ~ age + factor(race) + factor(year), data = nlswork)
coeftest(est_sandwich, vcov = vcovCL, cluster = ~ race + year)
Aaaand,只是为了强调关于速度的观点。这是三种不同方法的基准(使用两个固定的 FE 和双向聚类)。
est_feols = function() feols(ln_wage ~ age | race + year, cluster = ~race+year, data = nlswork)
est_felm = function() felm(ln_wage ~ age | race + year | 0 | race + year, data = nlswork)
est_standwich = function() {coeftest(lm(ln_wage ~ age + factor(race) + factor(year), data = nlswork),
vcov = vcovCL, cluster = ~ race + year)}
microbenchmark(est_feols(), est_felm(), est_standwich(), times = 3)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> est_feols() 11.94122 11.96158 12.55835 11.98193 12.86692 13.75191 3 a
#> est_felm() 87.18064 95.89905 100.69589 104.61746 107.45352 110.28957 3 b
#> est_standwich() 176.43502 183.95964 188.50271 191.48425 194.53656 197.58886 3 c
对于面板回归,该plm
软件包可以沿两个维度估计聚类 SE。
require(foreign)
require(plm)
require(lmtest)
test <- read.dta("http://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/test_data.dta")
##Double-clustering formula (Thompson, 2011)
vcovDC <- function(x, ...){
vcovHC(x, cluster="group", ...) + vcovHC(x, cluster="time", ...) -
vcovHC(x, method="white1", ...)
}
fpm <- plm(y ~ x, test, model='pooling', index=c('firmid', 'year'))
因此,现在您可以获得集群 SE:
##Clustered by *group*
> coeftest(fpm, vcov=function(x) vcovHC(x, cluster="group", type="HC1"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.029680 0.066952 0.4433 0.6576
x 1.034833 0.050550 20.4714 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##Clustered by *time*
> coeftest(fpm, vcov=function(x) vcovHC(x, cluster="time", type="HC1"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.029680 0.022189 1.3376 0.1811
x 1.034833 0.031679 32.6666 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##Clustered by *group* and *time*
> coeftest(fpm, vcov=function(x) vcovDC(x, type="HC1"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.029680 0.064580 0.4596 0.6458
x 1.034833 0.052465 19.7243 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
有关更多详细信息,请参阅:
但是,仅当您的数据可以强制转换为pdata.frame
. 如果你有它会失败"duplicate couples (time-id)"
。在这种情况下,您仍然可以聚类,但只能沿着一个维度。
通过仅指定一个索引来欺骗plm
您认为您拥有正确的面板数据集:
fpm.tr <- plm(y ~ x, test, model='pooling', index=c('firmid'))
因此,现在您可以获得集群 SE:
##Clustered by *group*
> coeftest(fpm.tr, vcov=function(x) vcovHC(x, cluster="group", type="HC1"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.029680 0.066952 0.4433 0.6576
x 1.034833 0.050550 20.4714 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
您还可以使用此解决方法按更高维度或更高级别(例如industry
或country
)进行聚类。但是,在这种情况下,您将无法使用group
(or time
) effects
,这是该方法的主要限制。
另一种适用于面板和其他类型数据的方法是multiwayvcov
包。它允许双重聚类,但也允许在更高维度进行聚类。根据包的网站,这是对 Arai 代码的改进:
- 由于缺失而丢失的观察结果的透明处理
- 全多路(或n路,或n维,或多维)聚类
使用彼得森数据和cluster.vcov()
:
library("lmtest")
library("multiwayvcov")
data(petersen)
m1 <- lm(y ~ x, data = petersen)
coeftest(m1, vcov=function(x) cluster.vcov(x, petersen[ , c("firmid", "year")]))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.029680 0.065066 0.4561 0.6483
## x 1.034833 0.053561 19.3206 <2e-16 ***
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Arai 的函数可用于对标准错误进行聚类。他有另一个用于多维度聚类的版本:
mcl <- function(dat,fm, cluster1, cluster2){
attach(dat, warn.conflicts = F)
library(sandwich);library(lmtest)
cluster12 = paste(cluster1,cluster2, sep="")
M1 <- length(unique(cluster1))
M2 <- length(unique(cluster2))
M12 <- length(unique(cluster12))
N <- length(cluster1)
K <- fm$rank
dfc1 <- (M1/(M1-1))*((N-1)/(N-K))
dfc2 <- (M2/(M2-1))*((N-1)/(N-K))
dfc12 <- (M12/(M12-1))*((N-1)/(N-K))
u1j <- apply(estfun(fm), 2, function(x) tapply(x, cluster1, sum))
u2j <- apply(estfun(fm), 2, function(x) tapply(x, cluster2, sum))
u12j <- apply(estfun(fm), 2, function(x) tapply(x, cluster12, sum))
vc1 <- dfc1*sandwich(fm, meat=crossprod(u1j)/N )
vc2 <- dfc2*sandwich(fm, meat=crossprod(u2j)/N )
vc12 <- dfc12*sandwich(fm, meat=crossprod(u12j)/N)
vcovMCL <- vc1 + vc2 - vc12
coeftest(fm, vcovMCL)}
有关参考和使用示例,请参见:
Frank Harrell 的包rms
(以前称为Design
)有一个我在聚类时经常使用的功能:robcov
.
例如,参见 的这一部分?robcov
。
cluster: a variable indicating groupings. ‘cluster’ may be any type of
vector (factor, character, integer). NAs are not allowed.
Unique values of ‘cluster’ indicate possibly correlated
groupings of observations. Note the data used in the fit and
stored in ‘fit$x’ and ‘fit$y’ may have had observations
containing missing values deleted. It is assumed that if any
NAs were removed during the original model fitting, an
‘naresid’ function exists to restore NAs so that the rows of
the score matrix coincide with ‘cluster’. If ‘cluster’ is
omitted, it defaults to the integers 1,2,...,n to obtain the
"sandwich" robust covariance matrix estimate.