8

我正在尝试在 R 中复制 Stata 输出。我正在使用数据集事务。我无法用稳健的标准误差复制概率函数。

Stata 代码如下所示:

probit affair male age yrsmarr kids relig educ ratemarr, r

我开始了:

 probit1 <- glm(affair ~ male + age + yrsmarr + kids + relig + educ + ratemarr, 
           family = binomial (link = "probit"), data = mydata)

然后我尝试了对sandwich包的各种调整,例如:

myProbit <- function(probit1, vcov = sandwich(..., adjust = TRUE)) {
            print(coeftest(probit1, vcov = sandwich(probit1, adjust = TRUE)))
}

或(所有类型HC0HC5):

myProbit <- function(probit1, vcov = sandwich) {
            print(coeftest(probit1, vcovHC(probit1, type = "HC0"))  
}

或者,正如这里所建议的那样(我是否必须输入不同的内容object?):

sandwich1 <- function(object, ...) sandwich(object) * nobs(object) / (nobs(object) - 1)
coeftest(probit1, vcov = sandwich1)

这些尝试都没有导致来自 stata 输出的相同标准错误或 z 值。

希望有建设性的意见!

提前致谢!

4

3 回答 3

3

对于正在考虑跳上这辆马车的人,这里有一些代码演示了这个问题(数据在这里):

clear
set more off
capture ssc install bcuse
capture ssc install rsource
bcuse affairs

saveold affairs, version(12) replace

rsource, terminator(XXX)
  library("foreign")
  library("lmtest")
  library("sandwich")
  mydata<-read.dta("affairs.dta")
  probit1<-glm(affair ~ male + age + yrsmarr + kids + relig + educ + ratemarr, family = binomial (link = "probit"), data = mydata)
  sandwich1 <- function(object,...) sandwich(object) * nobs(object)/(nobs(object) - 1)
  coeftest(probit1,vcov = sandwich1)
XXX 

probit affair male age yrsmarr kids relig educ ratemarr, robust cformat(%9.6f) nolog

R给出:

z test of coefficients:

             Estimate Std. Error z value  Pr(>|z|)    
(Intercept)  0.764157   0.546692  1.3978 0.1621780    
male         0.188816   0.133260  1.4169 0.1565119    
age         -0.024400   0.011423 -2.1361 0.0326725 *  
yrsmarr      0.054608   0.019025  2.8703 0.0041014 ** 
kids         0.208072   0.168222  1.2369 0.2161261    
relig       -0.186085   0.053968 -3.4480 0.0005647 ***
educ         0.015506   0.026389  0.5876 0.5568012    
ratemarr    -0.272711   0.053668 -5.0814 3.746e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

统计产量:

Probit regression                               Number of obs     =        601
                                                Wald chi2(7)      =      54.93
                                                Prob > chi2       =     0.0000
Log pseudolikelihood =  -305.2525               Pseudo R2         =     0.0961

------------------------------------------------------------------------------
             |               Robust
      affair |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        male |   0.188817   0.131927     1.43   0.152    -0.069755    0.447390
         age |  -0.024400   0.011124    -2.19   0.028    -0.046202   -0.002597
     yrsmarr |   0.054608   0.018963     2.88   0.004     0.017441    0.091775
        kids |   0.208075   0.166243     1.25   0.211    -0.117754    0.533905
       relig |  -0.186085   0.053240    -3.50   0.000    -0.290435   -0.081736
        educ |   0.015505   0.026355     0.59   0.556    -0.036150    0.067161
    ratemarr |  -0.272710   0.053392    -5.11   0.000    -0.377356   -0.168064
       _cons |   0.764160   0.534335     1.43   0.153    -0.283117    1.811437
------------------------------------------------------------------------------

附录:

系数的协方差估计的差异是由于不同的拟合算法造成的。在 R 中,该glm命令使用迭代最小二乘法,而 Stataprobit使用基于 Newton-Raphson 算法的 ML 方法。glm您可以使用以下选项匹配 R在 Stata中所做的事情irls

glm affair male age yrsmarr kids relig educ ratemarr, irls family(binomial) link(probit) robust

这产生:

Generalized linear models                         No. of obs      =        601
Optimization     : MQL Fisher scoring             Residual df     =        593
                   (IRLS EIM)                     Scale parameter =          1
Deviance         =  610.5049916                   (1/df) Deviance =   1.029519
Pearson          =  619.0405832                   (1/df) Pearson  =   1.043913

Variance function: V(u) = u*(1-u)                 [Bernoulli]
Link function    : g(u) = invnorm(u)              [Probit]

                                                  BIC             =  -3183.862

------------------------------------------------------------------------------
             |             Semirobust
      affair |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        male |   0.188817   0.133260     1.42   0.157    -0.072367    0.450002
         age |  -0.024400   0.011422    -2.14   0.033    -0.046787   -0.002012
     yrsmarr |   0.054608   0.019025     2.87   0.004     0.017319    0.091897
        kids |   0.208075   0.168222     1.24   0.216    -0.121634    0.537785
       relig |  -0.186085   0.053968    -3.45   0.001    -0.291862   -0.080309
        educ |   0.015505   0.026389     0.59   0.557    -0.036216    0.067226
    ratemarr |  -0.272710   0.053668    -5.08   0.000    -0.377898   -0.167522
       _cons |   0.764160   0.546693     1.40   0.162    -0.307338    1.835657
------------------------------------------------------------------------------

这些将是接近的,尽管不相同。我不知道如何让 R 在没有大量工作的情况下使用像 NR 这样的东西。

于 2015-05-14T23:23:42.623 回答
2

我正在使用此处详细描述的矩阵方法(第 57 页)将 R 结果与 Stata 匹配。但是,我还不能完全匹配结果。我认为微小的差异可能是由于分数的差异。R比赛得分Stata最多保留小数点后 4 位。

斯塔塔

clear all
bcuse affairs

probit affair male age yrsmarr kids relig educ ratemarr
mat var_nr=e(V)
predict double u, score
matrix accum s = male age yrsmarr kids relig educ ratemarr [iweight=u^2*601/600] //n=601,n-1=600
matrix rv = var_nr*s*var_nr
mat diagrv=vecdiag(rv)
matmap diagrv rse,m(sqrt(@)) //install matmap 
mat list rse //standard errors

这为您提供了与以下相同的标准错误:

qui probit affair male age yrsmarr kids relig educ ratemarr,r



rse[1,8]
       affair:    affair:    affair:    affair:    affair:    affair:    affair:    affair:
         male        age    yrsmarr       kids      relig       educ   ratemarr      _cons
r1  .13192707  .01112372  .01896336  .16624258  .05324046  .02635524  .05339163  .53433495

回复:

library(AER) # Affairs data
data(Affairs)
mydata<-Affairs
mydata$affairs<-with(mydata,ifelse(affairs>0,1,affairs)) # convert to 1 and 0 
probit1<-glm(affairs ~ gender+ age + yearsmarried + children + religiousness+education + rating,family = binomial(link = "probit"),data = mydata)
u<-subset(estfun(probit1),select="(Intercept)") #scores: perfectly matches to 4 decimals with Stata: difference may be due to this step
w0<-u%*%t(u)*(601/600) #(n/n-1)
iweight<-matrix(0,nrow=601,ncol=601) #perfectly matches to 4 decimals with Stata 
diag(iweight)<-diag(w0) 
x<-model.matrix(probit1)  
s<-t(x)%*%iweight%*%x #doesn't match with Stata : 
rv<-vcov(probit1)%*%s%*%vcov(probit1)
rse<-sqrt(diag(rv)) # standard  errors
   rse
  (Intercept)    gendermale           age  yearsmarried   childrenyes religiousness     education        rating 
   0.54669177    0.13325951    0.01142258    0.01902537    0.16822161    0.05396841    0.02638902    0.05366828 

这与:

 sandwich1 <- function(object, ...) sandwich(object) * nobs(object) / (nobs(object) - 1)
coeftest(probit1, vcov = sandwich1) 

结论:R和Stata之间的结果差异是由于分数的差异(最多匹配到小数点后4位)。

于 2015-05-16T03:33:00.587 回答
2

为了结束这个讨论,可以通过使用sampleSelection::probit估计和sandwich包(我使用版本 2.5)来匹配 R 中的原始 Stata 输出来计算稳健的标准误差。该probit函数使用最大似然,它的 Stata 对应函数也是如此。

与原帖一样,Stata 代码是

probit affair male age yrsmarr kids relig educ ratemarr, robust

这使

------------------------------------------------------------------------------
             |               Robust
      affair |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        male |   .1888175   .1319271     1.43   0.152    -.0697548    .4473898
         age |  -.0243996   .0111237    -2.19   0.028    -.0462017   -.0025975
     yrsmarr |    .054608   .0189634     2.88   0.004     .0174405    .0917755
        kids |   .2080754   .1662426     1.25   0.211     -.117754    .5339049
       relig |  -.1860854   .0532405    -3.50   0.000    -.2904348    -.081736
        educ |   .0155052   .0263552     0.59   0.556    -.0361501    .0671605
    ratemarr |  -.2727101   .0533916    -5.11   0.000    -.3773558   -.1680644
       _cons |     .76416    .534335     1.43   0.153    -.2831173    1.811437
------------------------------------------------------------------------------

给出相同结果的 R 代码是

library(AER)
library(sampleSelection)
data(Affairs)
Affairs$affair = Affairs$affairs > 0
Affairs$male = Affairs$gender == 'male'
reg = probit(affair ~ male + age + yearsmarried + children + religiousness +
           education + rating, data=Affairs)
print(coeftest(reg, vcovCL), digits=6)

这给

                Estimate Std. Error  t value   Pr(>|t|)    
(Intercept)    0.7641600  0.5343350  1.43011  0.1532109    
maleTRUE       0.1888175  0.1319271  1.43123  0.1528921    
age           -0.0243996  0.0111237 -2.19347  0.0286608 *  
yearsmarried   0.0546080  0.0189634  2.87966  0.0041248 ** 
childrenyes    0.2080755  0.1662426  1.25164  0.2111955    
religiousness -0.1860854  0.0532405 -3.49519  0.0005091 ***
education      0.0155052  0.0263552  0.58832  0.5565446    
rating        -0.2727101  0.0533916 -5.10773 4.4012e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

使用这些函数,都可以计算最大似然概率估计,并且都可以计算稳健的标准误差。顺便说一句:向sandwich包的作者致敬,它(IMO)确实清理了 R 中的标准误差计算。

于 2018-08-24T14:44:28.273 回答